Technical GlossaryNatural Language Processing
Unigram Language Model Tokenization
A method that learns a subunit vocabulary probabilistically to make token segmentation more data-aligned.
Unigram tokenization optimizes the subword vocabulary based on the probabilistic contribution of individual units. Unlike BPE, it is not driven by merge order but by a more general probabilistic model. It is widely used in the SentencePiece family and enables more flexible vocabulary design.
You Might Also Like
Explore these concepts to continue your artificial intelligence journey.
