# BPE / SentencePiece / Unigram: The Math of Tokenizer Algorithms and Training a TR-Aware Tokenizer from Scratch

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-bpe-sentencepiece-unigram-mathematics
> Updated: 2026-05-14T14:42:50.176Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part II — Tokenizer & Data Engineering
**TLDR:** BPE's merge table, SentencePiece's language-agnostic byte/char model, Unigram's EM training; why each results in different token efficiency. Training a 50K-vocab BPE on 1.5GB Turkish corpus on RTX 4090 (~12 min). Mathematical proof of why TR-aware tokenizer beats Llama-3's default by 1.6x.

