BPE vs WordPiece — fundamental mathematical difference?

**Selection criterion**. **BPE**: greedy merge by **frequency**. `argmax freq(pair)`. **WordPiece**: greedy merge by **likelihood**. `argmax (freq(pair) / (freq(a) × freq(b)))`. Motivation for likelihood ratio: shows pair is 'not coincidental'. Practical: WordPiece captures **morphology boundaries** slightly more accurately (statistically meaningful pairs). BPE simpler and faster training. Most modern LLMs use BPE; BERT-class uses WordPiece.

What happens if the order of `merge_rules` is broken in BPE training?

Encoding becomes **inconsistent**. BPE is deterministic — same text needs same tokenization. If order changes: same word may tokenize differently. Solution: `merges.txt` always ordered, encoding uses this order. Tiktoken implementation has immutable rules + cached. If training custom tokenizer, order shouldn't corrupt on save → save as JSON list or ordered text file.

Is BPE's compute complexity a bottleneck at modern LLM scale?

**Tokenizer training**: no, one-time cost (hours), negligible compared to production model training (weeks). **Inference encoding**: no, tiktoken ms response. **Inference decoding**: no, fast lookup. **If training custom tokenizer**: Rust implementation needed (single-thread Python → days, multi-thread Rust → hours). Module 6.3 has pedagogical pure Python implementation, Module 6.8 has production-grade HuggingFace Rust tokenizers.

Train a new tokenizer vs use existing — decision?

**Use existing as default**. Llama 3 tokenizer, GPT-4 tiktoken, BERTurk — all production-grade. Custom training for: (1) **Domain-specific** (code-only, biomedical-only). (2) **Language-specific Turkish-only** (Trendyol-LLM). (3) **Smaller vocab for special use case**. Custom training cost: tokenizer training ~hours-days, integration + verification testing 1-2 weeks. ROI: 20-40% token savings (Turkish-specific) → meaningful at millions of requests. For one-time domain: existing sufficient.

What is Karpathy's 'minbpe' repo, how to benefit?

**Karpathy 2024**, GitHub repo implementing BPE from scratch for pedagogical purposes: \`github.com/karpathy/minbpe\`. ~200 lines Python. Three implementations: (1) \`BasicBPE\`: simple Sennrich algorithm. (2) \`RegexBPE\`: GPT-style pre-tokenization. (3) \`GPT4Tokenizer\`: GPT-4 cl100k re-implementation. Usage: (a) Education — read code to understand BPE from inside. (b) Base for custom tokenizer training. (c) Fork and train on Turkish corpus for Turkish. We use this pattern with Turkish-extension in Module 6.3.

BPE Algorithm: Sennrich 2016 Line by Line — Pseudocode, Complexity, Edge Cases

BPE mathematical anatomy. Sennrich 2016 paper line by line: pre-tokenization, byte-pair merge counting, greedy merge selection, vocabulary construction, encoding logic, complexity analysis (O(N·V)), edge cases (Unicode, whitespace, special tokens). Full understanding before implementation in Module 6.3.

Şükrü Yusuf KAYA

55 min read

6/21/2026

Intermediate

BPE Algoritması: Sennrich 2016 Satır Satır — Pseudocode, Complexity, Edge Cases

🔬 Modern LLM'lerin tokenization motoru

BPE algoritması GPT-2, GPT-3, GPT-4, Llama 1-4, Mistral, DeepSeek-V3 — hepsinin tokenizer'ında. Sennrich 2016 NLP paper'ı bunu açıkladı. Bu ders algoritmayı matematiksel anatomi olarak inceliyor: pseudo-code line by line, complexity proof, edge case enumeration. 55 dakika sonra: BPE'yi 'kara kutu' değil 'şeffaf kutu' olarak kullanmak, kendi tokenizer'ına karar verme yeteneği.

Ders Haritası#

BPE'nin tarihi: Gage 1994 → Sennrich 2016
Algorithm yüksek seviye: merge-by-frequency
Pre-tokenization: word boundary
Training pseudo-code satır satır
Merge rules: nasıl saklanır
Encoding pseudo-code: training tersi
Complexity analysis: O(N·V) training, O(L²) encoding
Edge cases: Unicode, whitespace, special tokens
Byte-level BPE (GPT-2 variant) ön bakış
Modül 6.3'e köprü

1. BPE'nin Tarihi#

Philip Gage 1994 — Original BPE#

Compression algorithm:

"Replace most frequent pair of bytes with a new byte (token). Repeat."

Input: "aaabdaaabac"
Frequencies: aa=4, ab=2, bd=1, da=1, ac=1, ba=1
Most frequent: aa → replace with X
Result: "XabdXabac"
Continue...

Goal: data compression. Original CRC patent expired, public domain.

Sennrich 2016 — NLP adaptation#

Rico Sennrich, Barry Haddow, Alexandra Birch (Edinburgh) — "Neural Machine Translation of Rare Words with Subword Units", ACL 2016.

Adaptation:

Words üzerinde çalış (bytes değil)
Sub-word units öğren
Out-of-vocabulary problem solve

Motivation: NMT models couldn't handle rare/morphologically rich words. BPE solves this.

GPT-2 byte-level adaptation (2019)#

OpenAI: BPE'yi byte-level uygula. Vocabulary 256 byte ile başla, sub-byte unit üretme. Pure UTF-8 coverage.

Modern modifications#

GPT-4 cl100k: improved digit handling
GPT-4o o200k: better multilingual coverage
Llama 3 tiktoken-style: similar to OpenAI

Hepsinin temel BPE algoritması Sennrich 2016. Modifications training corpus + pre-tokenization detayları.

2. BPE Algorithm Yüksek Seviye#

Iteratif algoritma: en sık byte-pair'i bul, birleştir, tekrar.

Pseudo-code (high-level)#

1. Initialize vocab = {all characters/bytes}
2. Pre-tokenize text into words
3. For each word, split into characters: "hello" → ["h", "e", "l", "l", "o"]
4. Repeat until vocab_size hit:
   a. Count all adjacent byte-pairs across corpus
   b. Find most frequent pair (e.g., "lo" appears 1000 times)
   c. Add this pair as new token to vocab
   d. Replace all occurrences in corpus: "hello" → ["h", "e", "l", "lo"]
5. Save vocab + merge rules

Sonuç#

Vocabulary: characters + learned merged tokens
Merge rules: sıralı pair → new token mappings

Encoding (yeni text tokenize etme)#

Pre-tokenize text → words
Split each word into characters
Apply merge rules in training order
Final tokens

Training tersi, deterministic.

3. Pre-tokenization — Word Boundary#

BPE word-level çalışıyor (Sennrich orig.). Yani önce text'i kelimelere ayırıyoruz.

Pre-tokenization niye?#

İki sebep:

1. Cross-word merges'i önle

Eğer pre-tokenization yoksa, BPE "kedi köpek" cümlesinde "i k" merge edebilir (her ikisi kelimenin parçası). Anlam yok.

2. Compute speedup

Per-word processing → smaller working set → daha hızlı.

Pre-tokenization stratejileri#

GPT-2 regex

import regex as re
pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

Bu pattern:

English contractions ('s, 't, vb.) ayır
Letters group (Unicode-aware)
Numbers group
Punctuation group
Whitespace handling

GPT-4 cl100k regex (improved)

pat = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""")

Major change:

\p{N}{1,3}

— numbers limited to 3 digits at a time. Bu arithmetic improvement (Modül 4.2'de gördük).

Türkçe için

Türkçe Unicode letters (\p{L}) ile zaten capture. Ama "İstanbul'da" (apostrophe + Türkçe) tricky. GPT-4 regex apostrophe English contraction sayıyor — Türkçe'de yanlış.

Alternative: minimal pre-tokenization#

SentencePiece (Modül 6.5) whitespace bile pre-tokenize etmiyor — language-agnostic. BPE içinde whitespace bir karakter.

Practical#

Pre-tokenization kritik tasarım kararı. Llama 3, GPT-4, DeepSeek — her birinin kendi pattern'i. Türkçe-specific tokenizer için Türkçe-aware pattern yazılır.

4. BPE Training Pseudo-code — Satır Satır#

Algorithm: BPE Training
Input: corpus (text), vocab_size
Output: vocab, merge_rules

# Adım 1: Pre-tokenize
words = pre_tokenize(corpus)  # ["hello", "world", "hello", ...]

# Adım 2: Initialize
vocab = unique_characters(corpus)  # {h, e, l, o, w, r, d, ...}
word_freq = count(words)  # {"hello": 5, "world": 3, ...}

# Adım 3: Her kelimeyi karakter listesi olarak temsil et
word_repr = {}
for word, freq in word_freq:
    word_repr[word] = list(word)  # "hello" → ["h", "e", "l", "l", "o"]

# Adım 4: Iterative merge
merge_rules = []
while len(vocab) < vocab_size:
    # 4a. Count all adjacent pairs
    pair_freq = {}
    for word, freq in word_freq.items():
        chars = word_repr[word]
        for i in range(len(chars) - 1):
            pair = (chars[i], chars[i+1])
            pair_freq[pair] = pair_freq.get(pair, 0) + freq

    # 4b. Find most frequent pair
    if not pair_freq:
        break  # nothing to merge
    best_pair = max(pair_freq, key=pair_freq.get)

    # 4c. Add to vocab
    new_token = best_pair[0] + best_pair[1]  # e.g., "h" + "e" → "he"
    vocab.add(new_token)
    merge_rules.append(best_pair)

    # 4d. Update word_repr (replace pair with merged token)
    for word in word_repr:
        chars = word_repr[word]
        new_chars = []
        i = 0
        while i < len(chars):
            if i < len(chars) - 1 and (chars[i], chars[i+1]) == best_pair:
                new_chars.append(new_token)
                i += 2
            else:
                new_chars.append(chars[i])
                i += 1
        word_repr[word] = new_chars

return vocab, merge_rules

Adım analizi#

Adım 1 (pre_tokenize): O(N) — corpus boyutu

Adım 2 (initialize): O(N) — unique karakterler

Adım 3 (word_repr): O(V_init) — unique kelime sayısı

Adım 4 (loop):

Her iter: 4a + 4b + 4c + 4d
4a: O(N) — tüm word_repr scan
4b: O(P) — unique pair sayısı, max O(N)
4d: O(N) — corpus replace

Total iter = V_target - V_init ≈ V_target

Toplam: O(V_target × N).

Llama 3 vocab 128K, corpus 10GB: tokenizer training ~saatler. Yüksek paralelleştirilebilir (HuggingFace tokenizers Rust ile).

python

# BPE training walkthrough — küçük corpus
from collections import Counter
 
corpus = "low low low low low lower lower newer newer newer newest widest"
words = corpus.split()
word_freq = Counter(words)
print("Word frequencies:", word_freq)
 
# Initialize: her kelimeyi karakter listesi olarak
word_repr = {word: list(word) for word in word_freq}
print("Initial word repr:", {k: v for k, v in list(word_repr.items())[:3]})
 
# Iterative merge
merge_rules = []
vocab = set(c for word in word_freq for c in word)
print(f"Initial vocab size: {len(vocab)}")
 
for step in range(10):
    # 1. Count pairs
    pair_freq = Counter()
    for word, freq in word_freq.items():
        chars = word_repr[word]
        for i in range(len(chars) - 1):
            pair_freq[(chars[i], chars[i+1])] += freq
 
    if not pair_freq:
        break
 
    # 2. Find best pair
    best_pair = max(pair_freq, key=pair_freq.get)
    new_token = "".join(best_pair)
    print(f"Step {step}: best_pair={best_pair} freq={pair_freq[best_pair]} → new_token='{new_token}'")
    vocab.add(new_token)
    merge_rules.append(best_pair)
 
    # 3. Update word_repr
    for word in word_repr:
        chars = word_repr[word]
        new_chars = []
        i = 0
        while i < len(chars):
            if i < len(chars) - 1 and (chars[i], chars[i+1]) == best_pair:
                new_chars.append(new_token)
                i += 2
            else:
                new_chars.append(chars[i])
                i += 1
        word_repr[word] = new_chars
 
print("Final vocab:", sorted(vocab))
print("Merge rules:", merge_rules)

BPE training algorithm — küçük corpus üzerinde adım adım.

5. Merge Rules — Nasıl Saklanır#

BPE'nin state'i iki şey:

Vocabulary: token → ID mapping (örn.
{"hello": 5, "the": 10}
)
Merge rules: ordered list of pairs (örn.
[("h","e"), ("he","l"), ("hel","lo")]
)

File format (HuggingFace style)#

{
  "vocab": {
    "<unk>": 0,
    "<s>": 1,
    "</s>": 2,
    "a": 3,
    "b": 4,
    "ab": 5,
    "the": 6,
    ...
  },
  "merges": [
    "a b",
    "t h",
    "th e",
    ...
  ]
}

Merges sıralı — encoding sırasında bu sırada uygulanır.

Sennrich format (orig.)#

İki ayrı file:

vocab.txt
: bir token per line
merges.txt
: bir pair per line, sıralı

Modern (tiktoken)#

OpenAI tiktoken single file:

<binary blob>

Optimized for fast loading.

Encoding consistency#

Aynı text → aynı tokenization, deterministic. Çünkü:

Pre-tokenization deterministic
Merge rules sıralı uygulanır
Tie-breaking deterministic (first match wins)

6. BPE Encoding Pseudo-code#

Training tersi: text → token IDs.

Algorithm: BPE Encoding
Input: text, vocab, merge_rules
Output: token_ids

# Adım 1: Pre-tokenize
words = pre_tokenize(text)

# Adım 2: Her kelime için
result = []
for word in words:
    # 2a. Split into characters
    tokens = list(word)

    # 2b. Apply merge rules in order
    for rule in merge_rules:
        i = 0
        new_tokens = []
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == rule:
                new_tokens.append(tokens[i] + tokens[i+1])
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens

    # 2c. Convert to IDs
    ids = [vocab[t] for t in tokens]
    result.extend(ids)

return result

Walkthrough#

Text: "hello" Pre-tokenize: ["hello"] Initial split: ['h', 'e', 'l', 'l', 'o']

Apply rule 1: ("e", "l") → "el" After: ['h', 'el', 'l', 'o']

Apply rule 2: ("l", "o") → "lo" After: ['h', 'el', 'lo']

Apply rule 3: ("el", "lo") → "ello" After: ['h', 'ello']

Apply rule 4: ("h", "ello") → "hello" After: ['hello']

Token IDs:

vocab["hello"]

→ e.g., 7392

Optimization: priority queue#

Yukarıdaki naive O(L × R) where L is word length, R is merge rules count. Modern implementation:

Use a priority queue of pairs in word.
Find highest-priority pair (first in merge_rules).
Merge it.
Update queue.
Repeat.

Complexity O(L² log L). Daha pratik için tiktoken ve HuggingFace tokenizers Rust ile O(L log L) almış.

Türkçe encoding örnek#

"yürüyorum" (BPE-Türkçe with Türkçe-trained tokenizer):

Initial: ['y', 'ü', 'r', 'ü', 'y', 'o', 'r', 'u', 'm']
Apply rules → ['yü', 'rü', 'yor', 'um'] → ['yürü', 'yorum']

4-5 token. Standard multilingual: 6-8 token. Türkçe-tuned ile compression.

7. Complexity Analysis#

Training complexity#

N = corpus size (bytes or characters)
V = target vocabulary size
W = unique word count

Per iteration:

Pair counting: O(N) — scan corpus
Find max: O(P) where P is unique pairs (≤ N)
Update: O(N) — scan + replace

Total iterations: V (vocab grows from initial to target).

Total training: O(V × N).

Practical:

N = 10 GB Turkish corpus = 10^10 bytes
V = 64K
10^10 × 64K = 6.4 × 10^14 operations

Single CPU: weeks. Rust + parallel (HuggingFace tokenizers): hours.

Encoding complexity#

Per word of length L:

Apply rules: O(L × R) naive, O(L log L) optimized

For corpus encoding (production inference): each input ~few ms.

Memory complexity#

Training memory:

Corpus representation: O(N)
Word_freq: O(W)
Pair counts: O(P) — up to O(N) but usually much less

Production tokenizer (after training): vocab + merge rules. Llama 3 tokenizer file ~5 MB.

Production patterns#

Modern tokenizer libraries (tiktoken, HF tokenizers):

Pre-compiled C/Rust binary
Memory-mapped file (no full load)
Multi-threaded encoding
BPE algorithm with priority queue

GPT-4 cl100k production encoding: ~1M tokens/sec single thread.

8. Edge Cases#

BPE algorithm "simple" görünüyor ama pratik edge case'ler var:

1. Unicode normalization#

"é"

(e + combining acute) vs

"é"

(single codepoint). Aynı görünür ama farklı bytes.

Solution: NFC normalization before tokenization.

import unicodedata
text = unicodedata.normalize("NFC", text)

2. Whitespace handling#

" the"

(leading space) vs

"the"

(no space) — modern BPE'de ayrı token (Modül 4.2'de gördük).

GPT-2 pattern:

" ?\p{L}+"

— leading space optional. Encoding'de leading space kelimenin parçası.

3. Special tokens#

<s>

</s>

<pad>

<unk>

<|user|>

— bunlar BPE training'de olmamalı, manuel inject.

vocab = {
    "<s>": 0,
    "</s>": 1,
    "<pad>": 2,
    # BPE-learned tokens
    "a": 3, "the": 4, ...
}

4. Numbers#

GPT-3'te "123456789" tek token. Math benchmark'larda fatal — model digits arası operation yapamıyor.

GPT-4 cl100k: max 3 digits at a time → "1234567" = ["123", "4567"] veya ["1", "234", "567"]. Better arithmetic.

GPT-4o o200k: maybe per-digit, controversial. Modül 4.2 detay.

5. Multilingual tokenization#

İngilizce-heavy corpus → İngilizce well-tokenized, Türkçe inefficient.

Modern strategy: balanced corpus + per-language minimum vocab.

6. Code tokenization#

Code'da indentation (whitespace), operators, identifiers — BPE bunları nasıl handle eder?

def hello():
    print("Hello")

BPE may produce:

["def", " hello", "(", ")", ":", "\n", " ", "print", "(", "\"", "Hello", "\"", ")"]

. Indentation token-level represented.

7. Emoji + symbol#

"🌍" UTF-8 4 bytes. Byte-level BPE handle eder. Vocab'da yaygın emoji'ler tek token olabilir.

8. Whitespace-only sequences#

Triple newline, tab indentation — pattern olarak BPE öğrenir.

Pratik#

Modern tokenizer (tiktoken, HF) bu edge case'leri doğru handle ediyor. Custom tokenizer eğitirken bunları test et.

9. Byte-level BPE Ön Bakış#

GPT-2 (2019) yenilik: BPE'yi byte level uygula.

Klasik BPE (Sennrich)#

Initial vocab: unique characters (Unicode codepoints)
Sub-character merges yok
Multilingual: tüm dilleri kapsayacak vocab gerek

Byte-level BPE (GPT-2)#

Initial vocab: 256 bytes (UTF-8 alphabet)
Sub-byte merges yok
Tüm input encodable (any Unicode)

Avantajlar#

No OOV ever
Smaller initial vocab (256 vs 100K+)
Robust to typos / new chars / unseen scripts
Compact representation byte-level

Dezavantajlar#

More tokens per Turkish/Chinese char (UTF-8 2-4 bytes)
Initial sequence longer

Modern Llama, GPT, DeepSeek — hepsi byte-level BPE.

Pratik fark#

"hello"

ASCII = 5 byte ≈ 5 char. Same.

"Şükrü"

UTF-8 = 8 byte vs 5 char. Byte-level daha çok token initially. Ama merging ile 2-3 token oluyor.

Modül 6.6 detayda.

10. Modül 6.3'e Köprü#

Bu ders BPE'yi matematiksel olarak anlattı. Modül 6.3:

Sıfırdan implement (200 satır Python)
Training + encoding + decoding hepsi
Türkçe corpus üzerinde deneme
Performance: pure Python vs optimized Rust comparison
Visual: vocab + merge rules export

Ön hazırlık#

Modül 6.3'e başlamadan:

Yukarıdaki training pseudo-code'u kavradım mı?
Encoding pseudo-code'u nasıl çalışıyor anladım mı?
Edge case'leri öğrendim mi?
Karpathy "minbpe" repo'su açık (referans için)

Karpathy minbpe#

Karpathy 2024'te minbpe GitHub repo yayınladı: ~200 satır BPE implementation, pedagogical. Sennrich algorithm + GPT-style byte-level. Modül 6.3'te bu pattern'i Türkçe-uyarlamaya genişleteceğiz.

11. Mini Egzersizler#

Manual BPE trace: "ab ab cd cd cd" corpus'unda 3 merge'den sonra vocab?
Complexity: V=128K, N=10GB. CPU vs GPU vs Rust training süresi tahmini?
Encoding trace: "the cat" — merge rules: [("t","h"), ("th","e"), (" ","c"), ("c","a")]. Token sequence?
Edge case: "İstanbul'da" tokenization'ı GPT-4 (cl100k) regex pattern ile nasıl handle eder?
Multilingual corpus: %80 İngilizce + %20 Türkçe corpus üzerinde BPE training. Sonuç fair Türkçe coverage var mı?

Bu Derste Neler Öğrendik?#

✓ BPE tarihi: Gage 1994 compression → Sennrich 2016 NLP → GPT-2 byte-level ✓ High-level algorithm: iterative merge-by-frequency ✓ Pre-tokenization stratejileri (GPT-2 regex, GPT-4 cl100k improvements) ✓ Training pseudo-code satır satır + complexity O(V·N) ✓ Encoding pseudo-code + O(L²) typical ✓ Merge rules storage patterns ✓ 8 edge case: Unicode normalization, whitespace, special tokens, numbers, multilingual, code, emoji ✓ Byte-level BPE ön bakış (GPT-2 2019) ✓ Modül 6.3'e köprü: minbpe + Türkçe extension

Sıradaki Ders#

6.3 — BPE'yi 200 Satırda Sıfırdan Yaz: Training + Encoding + Decoding Karpathy "minbpe" stil sıfırdan implementation. Türkçe corpus üzerinde training, vocab visualize, encoding/decoding verification. Modern LLM tokenizer'larının iç işleyişi pratik olarak.

Frequently Asked Questions

**Emergent + hand-crafted hybrid**. **Hand-crafted**: pre-tokenization regex (GPT-2/4 differ), vocab size choice, special token injection, training corpus selection — these are manual decisions. **Emergent**: actual subword boundaries — automatic from corpus frequency. Sennrich algorithm math is simple (greedy merge); but each design decision (pre-tokenization, corpus, vocab size) dramatically affects downstream LLM quality. Modern frontier tokenizers (GPT-4o o200k) are the result of years of iterative tuning.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...