Skip to content
Technical GlossaryNatural Language Processing

SentencePiece

A tokenization framework that can learn subword vocabularies from raw text without relying on whitespace segmentation.

SentencePiece is an important tool especially for languages and multilingual systems in which whitespace-based word segmentation is unreliable. Because it operates directly on raw text, it comes closer to language independence. It enables flexible and reproducible token vocabulary construction in large-scale pretraining pipelines.