İçeriğe geç

BPE Algoritması: Sennrich 2016 Satır Satır — Pseudocode, Complexity, Edge Cases

BPE'nin matematik anatomi. Sennrich 2016 paper'ı satır satır: pre-tokenization, byte-pair merge counting, greedy merge selection, vocabulary inşası, encoding logic, complexity analysis (O(N·V)), edge cases (Unicode, whitespace, special tokens). Modül 6.3'te implement öncesi tam kavrama.

Şükrü Yusuf KAYA
55 dakikalık okuma
Orta
BPE Algoritması: Sennrich 2016 Satır Satır — Pseudocode, Complexity, Edge Cases
🔬 Modern LLM'lerin tokenization motoru
BPE algoritması GPT-2, GPT-3, GPT-4, Llama 1-4, Mistral, DeepSeek-V3 — hepsinin tokenizer'ında. Sennrich 2016 NLP paper'ı bunu açıkladı. Bu ders algoritmayı matematiksel anatomi olarak inceliyor: pseudo-code line by line, complexity proof, edge case enumeration. 55 dakika sonra: BPE'yi 'kara kutu' değil 'şeffaf kutu' olarak kullanmak, kendi tokenizer'ına karar verme yeteneği.

Ders Haritası#

  1. BPE'nin tarihi: Gage 1994 → Sennrich 2016
  2. Algorithm yüksek seviye: merge-by-frequency
  3. Pre-tokenization: word boundary
  4. Training pseudo-code satır satır
  5. Merge rules: nasıl saklanır
  6. Encoding pseudo-code: training tersi
  7. Complexity analysis: O(N·V) training, O(L²) encoding
  8. Edge cases: Unicode, whitespace, special tokens
  9. Byte-level BPE (GPT-2 variant) ön bakış
  10. Modül 6.3'e köprü

1. BPE'nin Tarihi#

Philip Gage 1994 — Original BPE#

Compression algorithm:
"Replace most frequent pair of bytes with a new byte (token). Repeat."
Input: "aaabdaaabac" Frequencies: aa=4, ab=2, bd=1, da=1, ac=1, ba=1 Most frequent: aa → replace with X Result: "XabdXabac" Continue...
Goal: data compression. Original CRC patent expired, public domain.

Sennrich 2016 — NLP adaptation#

Rico Sennrich, Barry Haddow, Alexandra Birch (Edinburgh) — "Neural Machine Translation of Rare Words with Subword Units", ACL 2016.
Adaptation:
  1. Words üzerinde çalış (bytes değil)
  2. Sub-word units öğren
  3. Out-of-vocabulary problem solve
Motivation: NMT models couldn't handle rare/morphologically rich words. BPE solves this.

GPT-2 byte-level adaptation (2019)#

OpenAI: BPE'yi byte-level uygula. Vocabulary 256 byte ile başla, sub-byte unit üretme. Pure UTF-8 coverage.

Modern modifications#

  • GPT-4 cl100k: improved digit handling
  • GPT-4o o200k: better multilingual coverage
  • Llama 3 tiktoken-style: similar to OpenAI
Hepsinin temel BPE algoritması Sennrich 2016. Modifications training corpus + pre-tokenization detayları.

2. BPE Algorithm Yüksek Seviye#

Iteratif algoritma: en sık byte-pair'i bul, birleştir, tekrar.

Pseudo-code (high-level)#

1. Initialize vocab = {all characters/bytes} 2. Pre-tokenize text into words 3. For each word, split into characters: "hello" → ["h", "e", "l", "l", "o"] 4. Repeat until vocab_size hit: a. Count all adjacent byte-pairs across corpus b. Find most frequent pair (e.g., "lo" appears 1000 times) c. Add this pair as new token to vocab d. Replace all occurrences in corpus: "hello" → ["h", "e", "l", "lo"] 5. Save vocab + merge rules

Sonuç#

  • Vocabulary: characters + learned merged tokens
  • Merge rules: sıralı pair → new token mappings

Encoding (yeni text tokenize etme)#

  1. Pre-tokenize text → words
  2. Split each word into characters
  3. Apply merge rules in training order
  4. Final tokens
Training tersi, deterministic.

3. Pre-tokenization — Word Boundary#

BPE word-level çalışıyor (Sennrich orig.). Yani önce text'i kelimelere ayırıyoruz.

Pre-tokenization niye?#

İki sebep:

1. Cross-word merges'i önle

Eğer pre-tokenization yoksa, BPE "kedi köpek" cümlesinde "i k" merge edebilir (her ikisi kelimenin parçası). Anlam yok.

2. Compute speedup

Per-word processing → smaller working set → daha hızlı.

Pre-tokenization stratejileri#

GPT-2 regex

import regex as re pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
Bu pattern:
  • English contractions ('s, 't, vb.) ayır
  • Letters group (Unicode-aware)
  • Numbers group
  • Punctuation group
  • Whitespace handling

GPT-4 cl100k regex (improved)

pat = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""")
Major change:
\p{N}{1,3}
— numbers limited to 3 digits at a time. Bu arithmetic improvement (Modül 4.2'de gördük).

Türkçe için

Türkçe Unicode letters (\p{L}) ile zaten capture. Ama "İstanbul'da" (apostrophe + Türkçe) tricky. GPT-4 regex apostrophe English contraction sayıyor — Türkçe'de yanlış.

Alternative: minimal pre-tokenization#

SentencePiece (Modül 6.5) whitespace bile pre-tokenize etmiyor — language-agnostic. BPE içinde whitespace bir karakter.

Practical#

Pre-tokenization kritik tasarım kararı. Llama 3, GPT-4, DeepSeek — her birinin kendi pattern'i. Türkçe-specific tokenizer için Türkçe-aware pattern yazılır.

4. BPE Training Pseudo-code — Satır Satır#

Algorithm: BPE Training Input: corpus (text), vocab_size Output: vocab, merge_rules # Adım 1: Pre-tokenize words = pre_tokenize(corpus) # ["hello", "world", "hello", ...] # Adım 2: Initialize vocab = unique_characters(corpus) # {h, e, l, o, w, r, d, ...} word_freq = count(words) # {"hello": 5, "world": 3, ...} # Adım 3: Her kelimeyi karakter listesi olarak temsil et word_repr = {} for word, freq in word_freq: word_repr[word] = list(word) # "hello" → ["h", "e", "l", "l", "o"] # Adım 4: Iterative merge merge_rules = [] while len(vocab) < vocab_size: # 4a. Count all adjacent pairs pair_freq = {} for word, freq in word_freq.items(): chars = word_repr[word] for i in range(len(chars) - 1): pair = (chars[i], chars[i+1]) pair_freq[pair] = pair_freq.get(pair, 0) + freq # 4b. Find most frequent pair if not pair_freq: break # nothing to merge best_pair = max(pair_freq, key=pair_freq.get) # 4c. Add to vocab new_token = best_pair[0] + best_pair[1] # e.g., "h" + "e" → "he" vocab.add(new_token) merge_rules.append(best_pair) # 4d. Update word_repr (replace pair with merged token) for word in word_repr: chars = word_repr[word] new_chars = [] i = 0 while i < len(chars): if i < len(chars) - 1 and (chars[i], chars[i+1]) == best_pair: new_chars.append(new_token) i += 2 else: new_chars.append(chars[i]) i += 1 word_repr[word] = new_chars return vocab, merge_rules

Adım analizi#

Adım 1 (pre_tokenize): O(N) — corpus boyutu
Adım 2 (initialize): O(N) — unique karakterler
Adım 3 (word_repr): O(V_init) — unique kelime sayısı
Adım 4 (loop):
  • Her iter: 4a + 4b + 4c + 4d
  • 4a: O(N) — tüm word_repr scan
  • 4b: O(P) — unique pair sayısı, max O(N)
  • 4d: O(N) — corpus replace
Total iter = V_target - V_init ≈ V_target
Toplam: O(V_target × N).
Llama 3 vocab 128K, corpus 10GB: tokenizer training ~saatler. Yüksek paralelleştirilebilir (HuggingFace tokenizers Rust ile).
python
# BPE training walkthrough — küçük corpus
from collections import Counter
 
corpus = "low low low low low lower lower newer newer newer newest widest"
words = corpus.split()
word_freq = Counter(words)
print("Word frequencies:", word_freq)
 
# Initialize: her kelimeyi karakter listesi olarak
word_repr = {word: list(word) for word in word_freq}
print("Initial word repr:", {k: v for k, v in list(word_repr.items())[:3]})
 
# Iterative merge
merge_rules = []
vocab = set(c for word in word_freq for c in word)
print(f"Initial vocab size: {len(vocab)}")
 
for step in range(10):
# 1. Count pairs
pair_freq = Counter()
for word, freq in word_freq.items():
chars = word_repr[word]
for i in range(len(chars) - 1):
pair_freq[(chars[i], chars[i+1])] += freq
 
if not pair_freq:
break
 
# 2. Find best pair
best_pair = max(pair_freq, key=pair_freq.get)
new_token = "".join(best_pair)
print(f"Step {step}: best_pair={best_pair} freq={pair_freq[best_pair]} → new_token='{new_token}'")
vocab.add(new_token)
merge_rules.append(best_pair)
 
# 3. Update word_repr
for word in word_repr:
chars = word_repr[word]
new_chars = []
i = 0
while i < len(chars):
if i < len(chars) - 1 and (chars[i], chars[i+1]) == best_pair:
new_chars.append(new_token)
i += 2
else:
new_chars.append(chars[i])
i += 1
word_repr[word] = new_chars
 
print("Final vocab:", sorted(vocab))
print("Merge rules:", merge_rules)
BPE training algorithm — küçük corpus üzerinde adım adım.

5. Merge Rules — Nasıl Saklanır#

BPE'nin state'i iki şey:
  1. Vocabulary: token → ID mapping (örn.
    {"hello": 5, "the": 10}
    )
  2. Merge rules: ordered list of pairs (örn.
    [("h","e"), ("he","l"), ("hel","lo")]
    )

File format (HuggingFace style)#

{ "vocab": { "<unk>": 0, "<s>": 1, "</s>": 2, "a": 3, "b": 4, "ab": 5, "the": 6, ... }, "merges": [ "a b", "t h", "th e", ... ] }
Merges sıralı — encoding sırasında bu sırada uygulanır.

Sennrich format (orig.)#

İki ayrı file:
  • vocab.txt
    : bir token per line
  • merges.txt
    : bir pair per line, sıralı

Modern (tiktoken)#

OpenAI tiktoken single file:
<binary blob>
Optimized for fast loading.

Encoding consistency#

Aynı text → aynı tokenization, deterministic. Çünkü:
  1. Pre-tokenization deterministic
  2. Merge rules sıralı uygulanır
  3. Tie-breaking deterministic (first match wins)

6. BPE Encoding Pseudo-code#

Training tersi: text → token IDs.
Algorithm: BPE Encoding Input: text, vocab, merge_rules Output: token_ids # Adım 1: Pre-tokenize words = pre_tokenize(text) # Adım 2: Her kelime için result = [] for word in words: # 2a. Split into characters tokens = list(word) # 2b. Apply merge rules in order for rule in merge_rules: i = 0 new_tokens = [] while i < len(tokens): if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == rule: new_tokens.append(tokens[i] + tokens[i+1]) i += 2 else: new_tokens.append(tokens[i]) i += 1 tokens = new_tokens # 2c. Convert to IDs ids = [vocab[t] for t in tokens] result.extend(ids) return result

Walkthrough#

Text: "hello" Pre-tokenize: ["hello"] Initial split: ['h', 'e', 'l', 'l', 'o']
Apply rule 1: ("e", "l") → "el" After: ['h', 'el', 'l', 'o']
Apply rule 2: ("l", "o") → "lo" After: ['h', 'el', 'lo']
Apply rule 3: ("el", "lo") → "ello" After: ['h', 'ello']
Apply rule 4: ("h", "ello") → "hello" After: ['hello']
Token IDs:
vocab["hello"]
→ e.g., 7392

Optimization: priority queue#

Yukarıdaki naive O(L × R) where L is word length, R is merge rules count. Modern implementation:
Use a priority queue of pairs in word. Find highest-priority pair (first in merge_rules). Merge it. Update queue. Repeat.
Complexity O(L² log L). Daha pratik için tiktoken ve HuggingFace tokenizers Rust ile O(L log L) almış.

Türkçe encoding örnek#

"yürüyorum" (BPE-Türkçe with Türkçe-trained tokenizer):
  • Initial: ['y', 'ü', 'r', 'ü', 'y', 'o', 'r', 'u', 'm']
  • Apply rules → ['yü', 'rü', 'yor', 'um'] → ['yürü', 'yorum']
4-5 token. Standard multilingual: 6-8 token. Türkçe-tuned ile compression.

7. Complexity Analysis#

Training complexity#

  • N = corpus size (bytes or characters)
  • V = target vocabulary size
  • W = unique word count
Per iteration:
  • Pair counting: O(N) — scan corpus
  • Find max: O(P) where P is unique pairs (≤ N)
  • Update: O(N) — scan + replace
Total iterations: V (vocab grows from initial to target).
Total training: O(V × N).
Practical:
  • N = 10 GB Turkish corpus = 10^10 bytes
  • V = 64K
  • 10^10 × 64K = 6.4 × 10^14 operations
Single CPU: weeks. Rust + parallel (HuggingFace tokenizers): hours.

Encoding complexity#

Per word of length L:
  • Apply rules: O(L × R) naive, O(L log L) optimized
For corpus encoding (production inference): each input ~few ms.

Memory complexity#

Training memory:
  • Corpus representation: O(N)
  • Word_freq: O(W)
  • Pair counts: O(P) — up to O(N) but usually much less
Production tokenizer (after training): vocab + merge rules. Llama 3 tokenizer file ~5 MB.

Production patterns#

Modern tokenizer libraries (tiktoken, HF tokenizers):
  • Pre-compiled C/Rust binary
  • Memory-mapped file (no full load)
  • Multi-threaded encoding
  • BPE algorithm with priority queue
GPT-4 cl100k production encoding: ~1M tokens/sec single thread.

8. Edge Cases#

BPE algorithm "simple" görünüyor ama pratik edge case'ler var:

1. Unicode normalization#

"é"
(e + combining acute) vs
"é"
(single codepoint). Aynı görünür ama farklı bytes.
Solution: NFC normalization before tokenization.
import unicodedata text = unicodedata.normalize("NFC", text)

2. Whitespace handling#

" the"
(leading space) vs
"the"
(no space) — modern BPE'de ayrı token (Modül 4.2'de gördük).
GPT-2 pattern:
" ?\p{L}+"
— leading space optional. Encoding'de leading space kelimenin parçası.

3. Special tokens#

<s>
,
</s>
,
<pad>
,
<unk>
,
<|user|>
— bunlar BPE training'de olmamalı, manuel inject.
vocab = { "<s>": 0, "</s>": 1, "<pad>": 2, # BPE-learned tokens "a": 3, "the": 4, ... }

4. Numbers#

GPT-3'te "123456789" tek token. Math benchmark'larda fatal — model digits arası operation yapamıyor.
GPT-4 cl100k: max 3 digits at a time → "1234567" = ["123", "4567"] veya ["1", "234", "567"]. Better arithmetic.
GPT-4o o200k: maybe per-digit, controversial. Modül 4.2 detay.

5. Multilingual tokenization#

İngilizce-heavy corpus → İngilizce well-tokenized, Türkçe inefficient.
Modern strategy: balanced corpus + per-language minimum vocab.

6. Code tokenization#

Code'da indentation (whitespace), operators, identifiers — BPE bunları nasıl handle eder?
def hello(): print("Hello")
BPE may produce:
["def", " hello", "(", ")", ":", "\n", " ", "print", "(", "\"", "Hello", "\"", ")"]
. Indentation token-level represented.

7. Emoji + symbol#

"🌍" UTF-8 4 bytes. Byte-level BPE handle eder. Vocab'da yaygın emoji'ler tek token olabilir.

8. Whitespace-only sequences#

Triple newline, tab indentation — pattern olarak BPE öğrenir.

Pratik#

Modern tokenizer (tiktoken, HF) bu edge case'leri doğru handle ediyor. Custom tokenizer eğitirken bunları test et.

9. Byte-level BPE Ön Bakış#

GPT-2 (2019) yenilik: BPE'yi byte level uygula.

Klasik BPE (Sennrich)#

  • Initial vocab: unique characters (Unicode codepoints)
  • Sub-character merges yok
  • Multilingual: tüm dilleri kapsayacak vocab gerek

Byte-level BPE (GPT-2)#

  • Initial vocab: 256 bytes (UTF-8 alphabet)
  • Sub-byte merges yok
  • Tüm input encodable (any Unicode)

Avantajlar#

  1. No OOV ever
  2. Smaller initial vocab (256 vs 100K+)
  3. Robust to typos / new chars / unseen scripts
  4. Compact representation byte-level

Dezavantajlar#

  1. More tokens per Turkish/Chinese char (UTF-8 2-4 bytes)
  2. Initial sequence longer
Modern Llama, GPT, DeepSeek — hepsi byte-level BPE.

Pratik fark#

"hello"
ASCII = 5 byte ≈ 5 char. Same.
"Şükrü"
UTF-8 = 8 byte vs 5 char. Byte-level daha çok token initially. Ama merging ile 2-3 token oluyor.
Modül 6.6 detayda.

10. Modül 6.3'e Köprü#

Bu ders BPE'yi matematiksel olarak anlattı. Modül 6.3:
  • Sıfırdan implement (200 satır Python)
  • Training + encoding + decoding hepsi
  • Türkçe corpus üzerinde deneme
  • Performance: pure Python vs optimized Rust comparison
  • Visual: vocab + merge rules export

Ön hazırlık#

Modül 6.3'e başlamadan:
  • Yukarıdaki training pseudo-code'u kavradım mı?
  • Encoding pseudo-code'u nasıl çalışıyor anladım mı?
  • Edge case'leri öğrendim mi?
  • Karpathy "minbpe" repo'su açık (referans için)

Karpathy minbpe#

Karpathy 2024'te minbpe GitHub repo yayınladı: ~200 satır BPE implementation, pedagogical. Sennrich algorithm + GPT-style byte-level. Modül 6.3'te bu pattern'i Türkçe-uyarlamaya genişleteceğiz.

11. Mini Egzersizler#

  1. Manual BPE trace: "ab ab cd cd cd" corpus'unda 3 merge'den sonra vocab?
  2. Complexity: V=128K, N=10GB. CPU vs GPU vs Rust training süresi tahmini?
  3. Encoding trace: "the cat" — merge rules: [("t","h"), ("th","e"), (" ","c"), ("c","a")]. Token sequence?
  4. Edge case: "İstanbul'da" tokenization'ı GPT-4 (cl100k) regex pattern ile nasıl handle eder?
  5. Multilingual corpus: %80 İngilizce + %20 Türkçe corpus üzerinde BPE training. Sonuç fair Türkçe coverage var mı?

Bu Derste Neler Öğrendik?#

BPE tarihi: Gage 1994 compression → Sennrich 2016 NLP → GPT-2 byte-level ✓ High-level algorithm: iterative merge-by-frequency ✓ Pre-tokenization stratejileri (GPT-2 regex, GPT-4 cl100k improvements) ✓ Training pseudo-code satır satır + complexity O(V·N) ✓ Encoding pseudo-code + O(L²) typical ✓ Merge rules storage patterns ✓ 8 edge case: Unicode normalization, whitespace, special tokens, numbers, multilingual, code, emoji ✓ Byte-level BPE ön bakış (GPT-2 2019) ✓ Modül 6.3'e köprü: minbpe + Türkçe extension

Sıradaki Ders#

6.3 — BPE'yi 200 Satırda Sıfırdan Yaz: Training + Encoding + Decoding Karpathy "minbpe" stil sıfırdan implementation. Türkçe corpus üzerinde training, vocab visualize, encoding/decoding verification. Modern LLM tokenizer'larının iç işleyişi pratik olarak.

Sık Sorulan Sorular

**Emergent + hand-crafted hibrit**. **Hand-crafted**: pre-tokenization regex (GPT-2/4 farklı), vocab size choice, special token injection, training corpus selection — bunlar manuel kararlar. **Emergent**: actual subword boundaries — corpus frequency'den otomatik. Sennrich algorithm matematik basit (greedy merge); ama her tasarım kararı (pre-tokenization, corpus, vocab size) downstream LLM quality'e dramatic etki eder. Modern frontier tokenizer'lar (GPT-4o o200k) yıllarca iterative tuning sonucu.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler