Technical GlossaryNatural Language Processing

Subword Tokenization

An approach that splits rare words into smaller meaningful pieces to balance vocabulary size and coverage.

Subword tokenization has become standard in modern NLP and large language models. It reduces the rare-word problem of word-level approaches while avoiding the extreme fragmentation of character-level methods. It is especially advantageous in agglutinative languages such as Turkish and in multilingual systems. It is one of the key design choices shaping how a model behaves in the face of unfamiliar words.