BPE Algorithm: Sennrich 2016 Line by Line — Pseudocode, Complexity, Edge Cases
BPE mathematical anatomy. Sennrich 2016 paper line by line: pre-tokenization, byte-pair merge counting, greedy merge selection, vocabulary construction, encoding logic, complexity analysis (O(N·V)), edge cases (Unicode, whitespace, special tokens). Full understanding before implementation in Module 6.3.
Şükrü Yusuf KAYA
55 min read
Intermediate🔬 Modern LLM'lerin tokenization motoru
BPE algoritması GPT-2, GPT-3, GPT-4, Llama 1-4, Mistral, DeepSeek-V3 — hepsinin tokenizer'ında. Sennrich 2016 NLP paper'ı bunu açıkladı. Bu ders algoritmayı matematiksel anatomi olarak inceliyor: pseudo-code line by line, complexity proof, edge case enumeration. 55 dakika sonra: BPE'yi 'kara kutu' değil 'şeffaf kutu' olarak kullanmak, kendi tokenizer'ına karar verme yeteneği.
Ders Haritası#
- BPE'nin tarihi: Gage 1994 → Sennrich 2016
- Algorithm yüksek seviye: merge-by-frequency
- Pre-tokenization: word boundary
- Training pseudo-code satır satır
- Merge rules: nasıl saklanır
- Encoding pseudo-code: training tersi
- Complexity analysis: O(N·V) training, O(L²) encoding
- Edge cases: Unicode, whitespace, special tokens
- Byte-level BPE (GPT-2 variant) ön bakış
- Modül 6.3'e köprü
1. BPE'nin Tarihi#
Philip Gage 1994 — Original BPE#
Compression algorithm:
"Replace most frequent pair of bytes with a new byte (token). Repeat."
Input: "aaabdaaabac" Frequencies: aa=4, ab=2, bd=1, da=1, ac=1, ba=1 Most frequent: aa → replace with X Result: "XabdXabac" Continue...
Goal: data compression. Original CRC patent expired, public domain.
Sennrich 2016 — NLP adaptation#
Rico Sennrich, Barry Haddow, Alexandra Birch (Edinburgh) — "Neural Machine Translation of Rare Words with Subword Units", ACL 2016.
Adaptation:
- Words üzerinde çalış (bytes değil)
- Sub-word units öğren
- Out-of-vocabulary problem solve
Motivation: NMT models couldn't handle rare/morphologically rich words. BPE solves this.
GPT-2 byte-level adaptation (2019)#
OpenAI: BPE'yi byte-level uygula. Vocabulary 256 byte ile başla, sub-byte unit üretme. Pure UTF-8 coverage.
Modern modifications#
- GPT-4 cl100k: improved digit handling
- GPT-4o o200k: better multilingual coverage
- Llama 3 tiktoken-style: similar to OpenAI
Hepsinin temel BPE algoritması Sennrich 2016. Modifications training corpus + pre-tokenization detayları.
2. BPE Algorithm Yüksek Seviye#
Iteratif algoritma: en sık byte-pair'i bul, birleştir, tekrar.
Pseudo-code (high-level)#
1. Initialize vocab = {all characters/bytes} 2. Pre-tokenize text into words 3. For each word, split into characters: "hello" → ["h", "e", "l", "l", "o"] 4. Repeat until vocab_size hit: a. Count all adjacent byte-pairs across corpus b. Find most frequent pair (e.g., "lo" appears 1000 times) c. Add this pair as new token to vocab d. Replace all occurrences in corpus: "hello" → ["h", "e", "l", "lo"] 5. Save vocab + merge rules
Sonuç#
- Vocabulary: characters + learned merged tokens
- Merge rules: sıralı pair → new token mappings
Encoding (yeni text tokenize etme)#
- Pre-tokenize text → words
- Split each word into characters
- Apply merge rules in training order
- Final tokens
Training tersi, deterministic.
3. Pre-tokenization — Word Boundary#
BPE word-level çalışıyor (Sennrich orig.). Yani önce text'i kelimelere ayırıyoruz.
Pre-tokenization niye?#
İki sebep:
1. Cross-word merges'i önle
Eğer pre-tokenization yoksa, BPE "kedi köpek" cümlesinde "i k" merge edebilir (her ikisi kelimenin parçası). Anlam yok.
2. Compute speedup
Per-word processing → smaller working set → daha hızlı.
Pre-tokenization stratejileri#
GPT-2 regex
import regex as re pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
Bu pattern:
- English contractions ('s, 't, vb.) ayır
- Letters group (Unicode-aware)
- Numbers group
- Punctuation group
- Whitespace handling
GPT-4 cl100k regex (improved)
pat = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""")
Major change: — numbers limited to 3 digits at a time. Bu arithmetic improvement (Modül 4.2'de gördük).
\p{N}{1,3}Türkçe için
Türkçe Unicode letters (\p{L}) ile zaten capture. Ama "İstanbul'da" (apostrophe + Türkçe) tricky. GPT-4 regex apostrophe English contraction sayıyor — Türkçe'de yanlış.
Alternative: minimal pre-tokenization#
SentencePiece (Modül 6.5) whitespace bile pre-tokenize etmiyor — language-agnostic. BPE içinde whitespace bir karakter.
Practical#
Pre-tokenization kritik tasarım kararı. Llama 3, GPT-4, DeepSeek — her birinin kendi pattern'i. Türkçe-specific tokenizer için Türkçe-aware pattern yazılır.
4. BPE Training Pseudo-code — Satır Satır#
Algorithm: BPE Training Input: corpus (text), vocab_size Output: vocab, merge_rules # Adım 1: Pre-tokenize words = pre_tokenize(corpus) # ["hello", "world", "hello", ...] # Adım 2: Initialize vocab = unique_characters(corpus) # {h, e, l, o, w, r, d, ...} word_freq = count(words) # {"hello": 5, "world": 3, ...} # Adım 3: Her kelimeyi karakter listesi olarak temsil et word_repr = {} for word, freq in word_freq: word_repr[word] = list(word) # "hello" → ["h", "e", "l", "l", "o"] # Adım 4: Iterative merge merge_rules = [] while len(vocab) < vocab_size: # 4a. Count all adjacent pairs pair_freq = {} for word, freq in word_freq.items(): chars = word_repr[word] for i in range(len(chars) - 1): pair = (chars[i], chars[i+1]) pair_freq[pair] = pair_freq.get(pair, 0) + freq # 4b. Find most frequent pair if not pair_freq: break # nothing to merge best_pair = max(pair_freq, key=pair_freq.get) # 4c. Add to vocab new_token = best_pair[0] + best_pair[1] # e.g., "h" + "e" → "he" vocab.add(new_token) merge_rules.append(best_pair) # 4d. Update word_repr (replace pair with merged token) for word in word_repr: chars = word_repr[word] new_chars = [] i = 0 while i < len(chars): if i < len(chars) - 1 and (chars[i], chars[i+1]) == best_pair: new_chars.append(new_token) i += 2 else: new_chars.append(chars[i]) i += 1 word_repr[word] = new_chars return vocab, merge_rules
Adım analizi#
Adım 1 (pre_tokenize): O(N) — corpus boyutu
Adım 2 (initialize): O(N) — unique karakterler
Adım 3 (word_repr): O(V_init) — unique kelime sayısı
Adım 4 (loop):
- Her iter: 4a + 4b + 4c + 4d
- 4a: O(N) — tüm word_repr scan
- 4b: O(P) — unique pair sayısı, max O(N)
- 4d: O(N) — corpus replace
Total iter = V_target - V_init ≈ V_target
Toplam: O(V_target × N).
Llama 3 vocab 128K, corpus 10GB: tokenizer training ~saatler. Yüksek paralelleştirilebilir (HuggingFace tokenizers Rust ile).
python
# BPE training walkthrough — küçük corpusfrom collections import Counter corpus = "low low low low low lower lower newer newer newer newest widest"words = corpus.split()word_freq = Counter(words)print("Word frequencies:", word_freq) # Initialize: her kelimeyi karakter listesi olarakword_repr = {word: list(word) for word in word_freq}print("Initial word repr:", {k: v for k, v in list(word_repr.items())[:3]}) # Iterative mergemerge_rules = []vocab = set(c for word in word_freq for c in word)print(f"Initial vocab size: {len(vocab)}") for step in range(10): # 1. Count pairs pair_freq = Counter() for word, freq in word_freq.items(): chars = word_repr[word] for i in range(len(chars) - 1): pair_freq[(chars[i], chars[i+1])] += freq if not pair_freq: break # 2. Find best pair best_pair = max(pair_freq, key=pair_freq.get) new_token = "".join(best_pair) print(f"Step {step}: best_pair={best_pair} freq={pair_freq[best_pair]} → new_token='{new_token}'") vocab.add(new_token) merge_rules.append(best_pair) # 3. Update word_repr for word in word_repr: chars = word_repr[word] new_chars = [] i = 0 while i < len(chars): if i < len(chars) - 1 and (chars[i], chars[i+1]) == best_pair: new_chars.append(new_token) i += 2 else: new_chars.append(chars[i]) i += 1 word_repr[word] = new_chars print("Final vocab:", sorted(vocab))print("Merge rules:", merge_rules)BPE training algorithm — küçük corpus üzerinde adım adım.
5. Merge Rules — Nasıl Saklanır#
BPE'nin state'i iki şey:
- Vocabulary: token → ID mapping (örn. )
{"hello": 5, "the": 10} - Merge rules: ordered list of pairs (örn. )
[("h","e"), ("he","l"), ("hel","lo")]
File format (HuggingFace style)#
{ "vocab": { "<unk>": 0, "<s>": 1, "</s>": 2, "a": 3, "b": 4, "ab": 5, "the": 6, ... }, "merges": [ "a b", "t h", "th e", ... ] }
Merges sıralı — encoding sırasında bu sırada uygulanır.
Sennrich format (orig.)#
İki ayrı file:
- : bir token per line
vocab.txt - : bir pair per line, sıralı
merges.txt
Modern (tiktoken)#
OpenAI tiktoken single file:
<binary blob>
Optimized for fast loading.
Encoding consistency#
Aynı text → aynı tokenization, deterministic. Çünkü:
- Pre-tokenization deterministic
- Merge rules sıralı uygulanır
- Tie-breaking deterministic (first match wins)
6. BPE Encoding Pseudo-code#
Training tersi: text → token IDs.
Algorithm: BPE Encoding Input: text, vocab, merge_rules Output: token_ids # Adım 1: Pre-tokenize words = pre_tokenize(text) # Adım 2: Her kelime için result = [] for word in words: # 2a. Split into characters tokens = list(word) # 2b. Apply merge rules in order for rule in merge_rules: i = 0 new_tokens = [] while i < len(tokens): if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == rule: new_tokens.append(tokens[i] + tokens[i+1]) i += 2 else: new_tokens.append(tokens[i]) i += 1 tokens = new_tokens # 2c. Convert to IDs ids = [vocab[t] for t in tokens] result.extend(ids) return result
Walkthrough#
Text: "hello"
Pre-tokenize: ["hello"]
Initial split: ['h', 'e', 'l', 'l', 'o']
Apply rule 1: ("e", "l") → "el"
After: ['h', 'el', 'l', 'o']
Apply rule 2: ("l", "o") → "lo"
After: ['h', 'el', 'lo']
Apply rule 3: ("el", "lo") → "ello"
After: ['h', 'ello']
Apply rule 4: ("h", "ello") → "hello"
After: ['hello']
Token IDs: → e.g., 7392
vocab["hello"]Optimization: priority queue#
Yukarıdaki naive O(L × R) where L is word length, R is merge rules count. Modern implementation:
Use a priority queue of pairs in word. Find highest-priority pair (first in merge_rules). Merge it. Update queue. Repeat.
Complexity O(L² log L). Daha pratik için tiktoken ve HuggingFace tokenizers Rust ile O(L log L) almış.
Türkçe encoding örnek#
"yürüyorum" (BPE-Türkçe with Türkçe-trained tokenizer):
- Initial: ['y', 'ü', 'r', 'ü', 'y', 'o', 'r', 'u', 'm']
- Apply rules → ['yü', 'rü', 'yor', 'um'] → ['yürü', 'yorum']
4-5 token. Standard multilingual: 6-8 token. Türkçe-tuned ile compression.
7. Complexity Analysis#
Training complexity#
- N = corpus size (bytes or characters)
- V = target vocabulary size
- W = unique word count
Per iteration:
- Pair counting: O(N) — scan corpus
- Find max: O(P) where P is unique pairs (≤ N)
- Update: O(N) — scan + replace
Total iterations: V (vocab grows from initial to target).
Total training: O(V × N).
Practical:
- N = 10 GB Turkish corpus = 10^10 bytes
- V = 64K
- 10^10 × 64K = 6.4 × 10^14 operations
Single CPU: weeks. Rust + parallel (HuggingFace tokenizers): hours.
Encoding complexity#
Per word of length L:
- Apply rules: O(L × R) naive, O(L log L) optimized
For corpus encoding (production inference): each input ~few ms.
Memory complexity#
Training memory:
- Corpus representation: O(N)
- Word_freq: O(W)
- Pair counts: O(P) — up to O(N) but usually much less
Production tokenizer (after training): vocab + merge rules. Llama 3 tokenizer file ~5 MB.
Production patterns#
Modern tokenizer libraries (tiktoken, HF tokenizers):
- Pre-compiled C/Rust binary
- Memory-mapped file (no full load)
- Multi-threaded encoding
- BPE algorithm with priority queue
GPT-4 cl100k production encoding: ~1M tokens/sec single thread.
8. Edge Cases#
BPE algorithm "simple" görünüyor ama pratik edge case'ler var:
1. Unicode normalization#
"é""é"Solution: NFC normalization before tokenization.
import unicodedata text = unicodedata.normalize("NFC", text)
2. Whitespace handling#
" the""the"GPT-2 pattern: — leading space optional. Encoding'de leading space kelimenin parçası.
" ?\p{L}+"3. Special tokens#
<s></s><pad><unk><|user|>vocab = { "<s>": 0, "</s>": 1, "<pad>": 2, # BPE-learned tokens "a": 3, "the": 4, ... }
4. Numbers#
GPT-3'te "123456789" tek token. Math benchmark'larda fatal — model digits arası operation yapamıyor.
GPT-4 cl100k: max 3 digits at a time → "1234567" = ["123", "4567"] veya ["1", "234", "567"]. Better arithmetic.
GPT-4o o200k: maybe per-digit, controversial. Modül 4.2 detay.
5. Multilingual tokenization#
İngilizce-heavy corpus → İngilizce well-tokenized, Türkçe inefficient.
Modern strategy: balanced corpus + per-language minimum vocab.
6. Code tokenization#
Code'da indentation (whitespace), operators, identifiers — BPE bunları nasıl handle eder?
def hello(): print("Hello")
BPE may produce: . Indentation token-level represented.
["def", " hello", "(", ")", ":", "\n", " ", "print", "(", "\"", "Hello", "\"", ")"]7. Emoji + symbol#
"🌍" UTF-8 4 bytes. Byte-level BPE handle eder. Vocab'da yaygın emoji'ler tek token olabilir.
8. Whitespace-only sequences#
Triple newline, tab indentation — pattern olarak BPE öğrenir.
Pratik#
Modern tokenizer (tiktoken, HF) bu edge case'leri doğru handle ediyor. Custom tokenizer eğitirken bunları test et.
9. Byte-level BPE Ön Bakış#
GPT-2 (2019) yenilik: BPE'yi byte level uygula.
Klasik BPE (Sennrich)#
- Initial vocab: unique characters (Unicode codepoints)
- Sub-character merges yok
- Multilingual: tüm dilleri kapsayacak vocab gerek
Byte-level BPE (GPT-2)#
- Initial vocab: 256 bytes (UTF-8 alphabet)
- Sub-byte merges yok
- Tüm input encodable (any Unicode)
Avantajlar#
- No OOV ever
- Smaller initial vocab (256 vs 100K+)
- Robust to typos / new chars / unseen scripts
- Compact representation byte-level
Dezavantajlar#
- More tokens per Turkish/Chinese char (UTF-8 2-4 bytes)
- Initial sequence longer
Modern Llama, GPT, DeepSeek — hepsi byte-level BPE.
Pratik fark#
"hello""Şükrü"Modül 6.6 detayda.
10. Modül 6.3'e Köprü#
Bu ders BPE'yi matematiksel olarak anlattı. Modül 6.3:
- Sıfırdan implement (200 satır Python)
- Training + encoding + decoding hepsi
- Türkçe corpus üzerinde deneme
- Performance: pure Python vs optimized Rust comparison
- Visual: vocab + merge rules export
Ön hazırlık#
Modül 6.3'e başlamadan:
- Yukarıdaki training pseudo-code'u kavradım mı?
- Encoding pseudo-code'u nasıl çalışıyor anladım mı?
- Edge case'leri öğrendim mi?
- Karpathy "minbpe" repo'su açık (referans için)
Karpathy minbpe#
Karpathy 2024'te minbpe GitHub repo yayınladı: ~200 satır BPE implementation, pedagogical. Sennrich algorithm + GPT-style byte-level. Modül 6.3'te bu pattern'i Türkçe-uyarlamaya genişleteceğiz.
11. Mini Egzersizler#
-
Manual BPE trace: "ab ab cd cd cd" corpus'unda 3 merge'den sonra vocab?
-
Complexity: V=128K, N=10GB. CPU vs GPU vs Rust training süresi tahmini?
-
Encoding trace: "the cat" — merge rules: [("t","h"), ("th","e"), (" ","c"), ("c","a")]. Token sequence?
-
Edge case: "İstanbul'da" tokenization'ı GPT-4 (cl100k) regex pattern ile nasıl handle eder?
-
Multilingual corpus: %80 İngilizce + %20 Türkçe corpus üzerinde BPE training. Sonuç fair Türkçe coverage var mı?
Bu Derste Neler Öğrendik?#
✓ BPE tarihi: Gage 1994 compression → Sennrich 2016 NLP → GPT-2 byte-level
✓ High-level algorithm: iterative merge-by-frequency
✓ Pre-tokenization stratejileri (GPT-2 regex, GPT-4 cl100k improvements)
✓ Training pseudo-code satır satır + complexity O(V·N)
✓ Encoding pseudo-code + O(L²) typical
✓ Merge rules storage patterns
✓ 8 edge case: Unicode normalization, whitespace, special tokens, numbers, multilingual, code, emoji
✓ Byte-level BPE ön bakış (GPT-2 2019)
✓ Modül 6.3'e köprü: minbpe + Türkçe extension
Sıradaki Ders#
6.3 — BPE'yi 200 Satırda Sıfırdan Yaz: Training + Encoding + Decoding
Karpathy "minbpe" stil sıfırdan implementation. Türkçe corpus üzerinde training, vocab visualize, encoding/decoding verification. Modern LLM tokenizer'larının iç işleyişi pratik olarak.
Frequently Asked Questions
**Emergent + hand-crafted hybrid**. **Hand-crafted**: pre-tokenization regex (GPT-2/4 differ), vocab size choice, special token injection, training corpus selection — these are manual decisions. **Emergent**: actual subword boundaries — automatic from corpus frequency. Sennrich algorithm math is simple (greedy merge); but each design decision (pre-tokenization, corpus, vocab size) dramatically affects downstream LLM quality. Modern frontier tokenizers (GPT-4o o200k) are the result of years of iterative tuning.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup