Technical GlossaryNatural Language Processing

Pretraining Corpus

The large text data pool used by a language model to acquire general linguistic and world knowledge.

The pretraining corpus strongly determines what linguistic patterns, domain knowledge, and cultural structures a model will learn. Beyond data volume, diversity, cleanliness, licensing, and language distribution are all critical. The behavior of large models is often shaped by corpus character as much as by architecture. For that reason, data selection is an inseparable part of model design.