# SentencePiece + Unigram LM (Kudo 2018): Probabilistic Tokenization and Subword Regularization

> Source: https://sukruyusufkaya.com/en/learn/llm-muhendisligi/sentencepiece-unigram-lm-kudo
> Updated: 2026-05-13T13:00:26.342Z
> Category: LLM Mühendisliği
> Module: Module 6: Tokenization Microsurgery
**TLDR:** SentencePiece framework + Unigram language model algorithm. Kudo 2018's probabilistic approach: start with large vocab, prune with EM. Viterbi forward encoding, subword regularization, ▁ whitespace-as-character. Llama, T5, Mistral's choice. Turkish and multilingual advantages.

