Skip to content
Technical GlossaryNatural Language Processing

Unigram Language Model Tokenization

A method that learns a subunit vocabulary probabilistically to make token segmentation more data-aligned.

Unigram tokenization optimizes the subword vocabulary based on the probabilistic contribution of individual units. Unlike BPE, it is not driven by merge order but by a more general probabilistic model. It is widely used in the SentencePiece family and enables more flexible vocabulary design.