Glossary Library

Technical GlossaryNatural Language Processing

Unigram Language Model Tokenization

A method that learns a subunit vocabulary probabilistically to make token segmentation more data-aligned.

Unigram tokenization optimizes the subword vocabulary based on the probabilistic contribution of individual units. Unlike BPE, it is not driven by merge order but by a more general probabilistic model. It is widely used in the SentencePiece family and enables more flexible vocabulary design.

You Might Also Like

Explore these concepts to continue your artificial intelligence journey.

Glossary Cover

Self-Attention

An attention mechanism in which each element in a sequence directly models its relationship with all others.

Glossary Cover

Encoder-Only Transformer

A Transformer architecture focused on contextual representation learning and used mainly for understanding tasks.

Glossary Cover

dogal-dil-isleme

Text Normalization

The process of standardizing raw text at the spelling, formatting, and character levels to make it more consistent and processable.