Skip to content
Technical GlossaryNatural Language Processing

Tokenization

The core language processing step that splits text into units that a model can process.

Tokenization is one of the foundational decisions that determines how an NLP system sees text. The split can happen at the word, subword, character, or special-symbol level. This choice affects not only model input, but also vocabulary size, error tolerance, and multilingual behavior. Although it may look like a technical detail, tokenization is one of the structural components at the core of model behavior.