Technical GlossaryNatural Language Processing
Subword Tokenization
An approach that splits rare words into smaller meaningful pieces to balance vocabulary size and coverage.
Subword tokenization has become standard in modern NLP and large language models. It reduces the rare-word problem of word-level approaches while avoiding the extreme fragmentation of character-level methods. It is especially advantageous in agglutinative languages such as Turkish and in multilingual systems. It is one of the key design choices shaping how a model behaves in the face of unfamiliar words.
You Might Also Like
Explore these concepts to continue your artificial intelligence journey.
