Skip to content
Technical GlossaryNatural Language Processing

Text Deduplication

A process that removes identical or near-duplicate text samples from a dataset to improve training and evaluation quality.

Text deduplication is one of the quiet but high-impact quality steps in large-scale corpus preparation. Repeated copies of the same or almost identical text can create bias, memorization, and misleading evaluation. It is especially critical in LLM pretraining, retrieval indexing, and test-set hygiene.