İçeriğe geç

BPE'yi 200 Satırda Sıfırdan Yaz: Training + Encoding + Decoding + Türkçe Corpus

Karpathy minbpe stil sıfırdan implementation: pure Python BPE training (Sennrich algorithm), encoding/decoding, regex pre-tokenization, byte-level extension, Türkçe corpus üzerinde train + Trendyol-LLM ile karşılaştırma. Modern LLM tokenizer'larını pratik anlama.

Şükrü Yusuf KAYA
60 dakikalık okuma
İleri
BPE'yi 200 Satırda Sıfırdan Yaz: Training + Encoding + Decoding + Türkçe Corpus
💻 'Sıfırdan yaz' felsefesinin tokenization versiyonu
Modül 6.2'de BPE'yi matematiksel olarak inceledik. Şimdi gerçekten yazıyoruz. Karpathy'nin minbpe felsefesi: ~200 satır pure Python ile her şey. Sonunda Türkçe corpus üzerinde train edip token efficiency'i Trendyol-LLM ve OpenAI'ya karşılaştıracağız. 60 dakika sonra: BPE'yi 'kara kutu' olarak değil, kendi yazdığın motor olarak göreceksin.

Ders Haritası#

  1. Implementation strategy: Karpathy minbpe + Türkçe extension
  2. BasicBPE class — vocab, merge rules state
  3. Training function — Sennrich algorithm
  4. Encode function — text → token IDs
  5. Decode function — token IDs → text
  6. Regex pre-tokenization ekleme
  7. Byte-level extension ekleme
  8. Türkçe corpus'ta train — 1-10MB sample
  9. Trendyol-LLM vs OpenAI vs custom karşılaştırma
  10. Save/load + production patterns
  11. Limitations + sonraki adım

1. Implementation Strategy#

Karpathy minbpe felsefesi#

  • Pure Python (Rust gerek değil)
  • Pedagogical clarity > production speed
  • ~200 line total
  • Three classes: BasicBPE, RegexBPE, GPT4Tokenizer

Bizim plan#

Aynı yapı ama Türkçe extension ile:
  1. TurkceBasicBPE: pure character-level BPE
  2. TurkceRegexBPE: GPT-style pre-tokenization + Türkçe-aware
  3. TurkceByteLevelBPE: byte-level + Türkçe corpus
Türkçe corpus üzerinde train, comparison.

Why pure Python?#

Production HuggingFace tokenizers Rust ile — saatler yerine dakikalar. Ama pedagogical learning için Python OK. Modül 6.8'de production Rust pattern.

2. BasicBPE Class — Vocab + Merge Rules#

class TurkceBasicBPE: def __init__(self): self.merges = {} # (int, int) -> int self.vocab = {} # int -> bytes self._init_base_vocab() def _init_base_vocab(self): # 256 bytes as initial vocab for i in range(256): self.vocab[i] = bytes([i]) def __repr__(self): return f"TurkceBasicBPE(vocab_size={len(self.vocab)}, merges={len(self.merges)})"

State#

  • vocab
    :
    int -> bytes
    mapping (token ID -> byte sequence)
  • merges
    :
    (int, int) -> int
    (pair -> new token ID)

Initial vocab#

256 byte values (0-255). Byte-level BPE'nin temeli. Her UTF-8 byte tek token.

3. Training Function — Sennrich Algorithm#

def train(self, text, vocab_size, verbose=False): assert vocab_size >= 256 num_merges = vocab_size - 256 # Adım 1: text → byte sequence text_bytes = text.encode("utf-8") ids = list(text_bytes) # initially: int 0-255 sequence # Adım 2: iterative merge merges = {} for i in range(num_merges): # 2a. Count adjacent pairs stats = {} for j in range(len(ids) - 1): pair = (ids[j], ids[j + 1]) stats[pair] = stats.get(pair, 0) + 1 if not stats: break # 2b. Find best pair best_pair = max(stats, key=stats.get) new_id = 256 + i # 2c. Apply merge ids = self._merge(ids, best_pair, new_id) # 2d. Save merge rule merges[best_pair] = new_id # 2e. Update vocab self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]] if verbose and i % 100 == 0: print(f"Step {i}: merged {best_pair} -> {new_id}, " f"freq={stats[best_pair]}, ids_len={len(ids)}") self.merges = merges def _merge(self, ids, pair, new_id): """Replace all occurrences of pair with new_id.""" new_ids = [] i = 0 while i < len(ids): if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]: new_ids.append(new_id) i += 2 else: new_ids.append(ids[i]) i += 1 return new_ids

Analiz#

  • text.encode("utf-8")
    → byte sequence, her byte 0-255 int
  • _merge
    helper → corpus replace pair with new ID
  • vocab[new_id]
    → bytes concatenation (göstermek için, encoding'de unused)

Pure Python complexity#

O(V × N). 1MB corpus, V=1000: ~10^9 ops. Single thread Python: dakikalar. Rust: saniyeler. Pedagogical OK.

4. Encode Function — Text → Token IDs#

def encode(self, text): """Convert text to token IDs.""" text_bytes = text.encode("utf-8") ids = list(text_bytes) # initial byte sequence while len(ids) >= 2: # Find pair with lowest merge index (earliest learned) stats = self._get_stats(ids) # Among all current pairs, find which one's in merges with lowest priority pair = min(stats, key=lambda p: self.merges.get(p, float("inf"))) if pair not in self.merges: break # no more merges applicable idx = self.merges[pair] ids = self._merge(ids, pair, idx) return ids def _get_stats(self, ids): """Count pairs in ids.""" stats = {} for pair in zip(ids, ids[1:]): stats[pair] = stats.get(pair, 0) + 1 return stats

Strategi#

Naive: training order'da merge rules uygula → O(V × L). Slow.
Optimized: her step'te uygulanabilir rules arasında earliest learned olanı seç → priority queue benzeri. Daha hızlı.

Complexity#

O(L² log L) worst case. Production tokenizer'lar (tiktoken) O(L log L) with smart data structures. Pedagogical Python OK.

5. Decode Function — Token IDs → Text#

def decode(self, ids): """Convert token IDs back to text.""" text_bytes = b"".join(self.vocab[idx] for idx in ids) text = text_bytes.decode("utf-8", errors="replace") return text

Simple ama dikkat#

  • self.vocab[idx]
    → byte sequence o token için
  • b"".join
    → concatenate
  • decode("utf-8")
    → text

errors="replace"
#

UTF-8 invalid byte sequence'lar olabilir (training corpus'ta corrupt data).
errors="replace"
ile crash etmeden
ile değiştir.

Roundtrip test#

text = "Merhaba dünya!" ids = tokenizer.encode(text) decoded = tokenizer.decode(ids) assert decoded == text, f"Roundtrip failed: {text!r} → {decoded!r}"
Iyi training'de roundtrip bit-exact. Some edge cases:
  • Surrogate pairs
  • Combining characters
  • Whitespace normalization
python
# Tam BasicBPE class — ~50 line Python
class TurkceBasicBPE:
def __init__(self):
self.merges = {}
self.vocab = {i: bytes([i]) for i in range(256)}
 
def train(self, text, vocab_size, verbose=False):
assert vocab_size >= 256
num_merges = vocab_size - 256
ids = list(text.encode("utf-8"))
 
for i in range(num_merges):
stats = self._get_stats(ids)
if not stats:
break
best_pair = max(stats, key=stats.get)
new_id = 256 + i
ids = self._merge(ids, best_pair, new_id)
self.merges[best_pair] = new_id
self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]
if verbose and i % 100 == 0:
print(f"Step {i}: {best_pair} -> {new_id}, len={len(ids)}")
 
def encode(self, text):
ids = list(text.encode("utf-8"))
while len(ids) >= 2:
stats = self._get_stats(ids)
pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
if pair not in self.merges:
break
ids = self._merge(ids, pair, self.merges[pair])
return ids
 
def decode(self, ids):
return b"".join(self.vocab[idx] for idx in ids).decode("utf-8", errors="replace")
 
@staticmethod
def _get_stats(ids):
stats = {}
for pair in zip(ids, ids[1:]):
stats[pair] = stats.get(pair, 0) + 1
return stats
 
@staticmethod
def _merge(ids, pair, new_id):
new_ids = []
i = 0
while i < len(ids):
if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
new_ids.append(new_id)
i += 2
else:
new_ids.append(ids[i])
i += 1
return new_ids
 
 
# Test
text = "Merhaba dünya! Türkçe BPE tokenizer'ı sıfırdan yazıyoruz."
text_repeated = (text + " ") * 50 # 50x repeat for training data
 
tok = TurkceBasicBPE()
tok.train(text_repeated, vocab_size=300, verbose=True)
 
encoded = tok.encode("Merhaba dünya")
decoded = tok.decode(encoded)
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"Compression: {len('Merhaba dünya'.encode('utf-8'))} bytes → {len(encoded)} tokens")
Tam BasicBPE class — ~50 line, Türkçe corpus üzerinde test.

6. Regex Pre-tokenization Ekleme#

GPT-2 style pre-tokenization. Word boundary'leri yakala.
import regex as re GPT2_PATTERN = re.compile( r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+" ) class TurkceRegexBPE(TurkceBasicBPE): def __init__(self): super().__init__() self.pattern = GPT2_PATTERN def train(self, text, vocab_size, verbose=False): # Pre-tokenize first chunks = self.pattern.findall(text) # Each chunk → byte sequence chunk_ids = [list(chunk.encode("utf-8")) for chunk in chunks] # Initialize num_merges = vocab_size - 256 # Iterative merge (within chunks, not cross-chunk) for i in range(num_merges): stats = {} for ids in chunk_ids: for pair in zip(ids, ids[1:]): stats[pair] = stats.get(pair, 0) + 1 if not stats: break best_pair = max(stats, key=stats.get) new_id = 256 + i # Apply merge to all chunks chunk_ids = [self._merge(ids, best_pair, new_id) for ids in chunk_ids] self.merges[best_pair] = new_id self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]] if verbose and i % 100 == 0: total_tokens = sum(len(ids) for ids in chunk_ids) print(f"Step {i}: {best_pair} -> {new_id}, total tokens={total_tokens}") def encode(self, text): chunks = self.pattern.findall(text) all_ids = [] for chunk in chunks: ids = list(chunk.encode("utf-8")) while len(ids) >= 2: stats = self._get_stats(ids) pair = min(stats, key=lambda p: self.merges.get(p, float("inf"))) if pair not in self.merges: break ids = self._merge(ids, pair, self.merges[pair]) all_ids.extend(ids) return all_ids

Türkçe için pattern modification#

GPT-2 regex İngilizce apostrof handling — Türkçe'de yanlış. Türkçe-aware pattern:
TR_PATTERN = re.compile( r"\p{L}+(?:'\p{L}+)?| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+" )
Türkçe apostrophe genelde kelimenin parçası (e.g., "İstanbul'da") — pattern bunu capture eder.

7. Byte-Level Extension#

Yukarıdaki implementation zaten byte-level
text.encode("utf-8")
ile byte'a çeviriyoruz. Yeni özellik gerek değil.
Ama vocabulary size handling önemli:
class TurkceByteLevelBPE(TurkceRegexBPE): def __init__(self): super().__init__() # Token ID assignments: # 0-255: bytes # 256-V: learned merges # V+: special tokens (assigned after training) self.special_tokens = {} def register_special_tokens(self, tokens): """Register special tokens (assigned IDs after vocab).""" for tok in tokens: new_id = len(self.vocab) + len(self.special_tokens) self.special_tokens[tok] = new_id def encode_with_special(self, text, allowed_special=None): """Encode text, recognizing special tokens.""" if allowed_special is None: allowed_special = set(self.special_tokens.keys()) # Find special tokens in text, split around them # ... (complex implementation)

Special tokens#

<s>
,
</s>
,
<pad>
,
<|user|>
,
<|assistant|>
— bunlar BPE training'de hiç görmediği ID'ler. Manual inject.
Llama 3'te ~50 special token. Vocab 128K - special - BPE-learned.

8. Türkçe Corpus'ta Train#

Türkçe Wikipedia, news article veya literature corpus.
# 10 MB Türkçe corpus with open("turkce_corpus.txt", "r", encoding="utf-8") as f: turkce_text = f.read() print(f"Corpus size: {len(turkce_text) / 1024**2:.2f} MB") tok = TurkceRegexBPE() tok.train(turkce_text, vocab_size=1000, verbose=True)

Vocab inspection#

# Show learned merges print("First 50 learned tokens:") for i in range(256, 306): print(f" {i}: {tok.vocab[i].decode('utf-8', errors='replace')!r}")
Beklenen output:
256: ' ' 257: 'e ' 258: 'er' 259: 'le' 260: ' bi' 261: 'an' 262: 'ar' 263: 'in' 264: 'ler' ← Türkçe çoğul! 265: 'lar' ← Türkçe çoğul! ...
Türkçe morfoloji pattern'leri ortaya çıkıyor.

Eval: Türkçe paragraph encoding#

test_text = """ Türkiye Cumhuriyeti, Avrupa ve Asya kıtaları arasında köprü konumundadır. Anadolu yarımadasında 783,562 km² alana sahiptir. """ ids = tok.encode(test_text) print(f"Text bytes: {len(test_text.encode('utf-8'))}") print(f"Tokens: {len(ids)}") print(f"Bytes per token: {len(test_text.encode('utf-8')) / len(ids):.2f}") # Custom: ~2.0 byte/token (decent) # GPT-4 cl100k: ~3-4 byte/token (multilingual) # Türkçe-tuned: ~4-5 byte/token

9. Comparison: Custom vs Trendyol-LLM vs OpenAI#

# Custom tokenizer (our work) custom_tok = TurkceRegexBPE() custom_tok.train(turkce_text, vocab_size=10000) # Trendyol-LLM tokenizer from transformers import AutoTokenizer trendyol_tok = AutoTokenizer.from_pretrained("Trendyol/Trendyol-LLM-7b-base-v1.0") # OpenAI tiktoken (GPT-4o) import tiktoken gpt4o_tok = tiktoken.get_encoding("o200k_base") # Test text test = """ Yapay zeka, makine öğrenmesi alanında büyük ilerlemeler kaydetti. Türkiye'de bu teknolojinin gelişimi son yıllarda hızlandı. 2026 yılı itibarıyla, frontier modeller Türkçe'yi neredeyse mükemmel anlıyor. """ custom_tokens = custom_tok.encode(test) trendyol_tokens = trendyol_tok.encode(test, add_special_tokens=False) gpt4o_tokens = gpt4o_tok.encode(test) print(f"Text bytes: {len(test.encode('utf-8'))}") print(f"Custom (10K vocab): {len(custom_tokens)} tokens") print(f"Trendyol-LLM (32K vocab): {len(trendyol_tokens)} tokens") print(f"GPT-4o o200k (200K vocab): {len(gpt4o_tokens)} tokens")

Beklenen sonuç#

Text bytes: 350 Custom (10K vocab): 110 tokens (~3.2 byte/tok) Trendyol-LLM (32K vocab): 75 tokens (~4.7 byte/tok) GPT-4o o200k (200K vocab): 95 tokens (~3.7 byte/tok)

Analiz#

  • Trendyol-LLM Türkçe için optimize → en kompakt (4.7 byte/tok)
  • GPT-4o o200k multilingual → orta (3.7 byte/tok). cl100k'dan iyi ama Trendyol kadar değil
  • Custom 10K vocab küçük + corpus küçük → en az kompakt

Türkçe-tuned'in avantajı#

Same paragraph:
  • Trendyol-LLM: 75 token → $0.001 (gpt-5-mini fiyatla)
  • GPT-4o: 95 token → $0.0013 (%27 daha pahalı)
  • Custom 10K: 110 token → $0.0014 (%47 daha pahalı)
Türk şirket için kendi Türkçe-tuned tokenizer maliyet tasarrufu.

10. Save / Load + Production Patterns#

import json import base64 def save(self, path): data = { "merges": {f"{k[0]},{k[1]}": v for k, v in self.merges.items()}, "vocab": {str(k): base64.b64encode(v).decode("ascii") for k, v in self.vocab.items()}, "pattern": self.pattern.pattern if hasattr(self, "pattern") else None, } with open(path, "w") as f: json.dump(data, f) def load(self, path): with open(path) as f: data = json.load(f) self.merges = {tuple(map(int, k.split(","))): v for k, v in data["merges"].items()} self.vocab = {int(k): base64.b64decode(v) for k, v in data["vocab"].items()} if data["pattern"]: self.pattern = re.compile(data["pattern"])

File size#

Vocab 10K, merges 10K → ~500 KB JSON. Production HuggingFace format daha compact (binary).

Production deployment#

Production'da:
  1. Pre-load once: server boot'ta tokenizer load
  2. Multi-thread encoding: HuggingFace tokenizers Rust'ta zaten thread-safe
  3. Cache results: yaygın prompt'lar için
  4. Validate: production roundtrip test (encode → decode → compare)

When to use custom Python BPE?#

  • Pedagogical / learning (this lesson)
  • Research prototyping
  • Small-scale projects (< 1K req/s)
When to use HuggingFace tokenizers / tiktoken (production):
  • Production, high QPS
  • Compatibility with existing models
  • Multi-thread efficiency

11. Limitations + Sonraki Adım#

Bizim implementation'ın sınırları#

  1. Slow: pure Python, single thread → büyük corpus impractical
  2. No multiprocessing for stats counting
  3. Memory: tüm corpus RAM'de
  4. No incremental training: full retrain gerekli yeni veri için
  5. No subword regularization (Kudo 2018)

Production-grade alternatif#

  • HuggingFace tokenizers (Rust): hızlı, multi-thread, well-tested
  • tiktoken (Rust): OpenAI'nın production tokenizer
  • SentencePiece (C++): Google'ın multilingual tokenizer

Modül 6'nın geri kalanı#

  • 6.4: WordPiece (BERT) — likelihood-based variant
  • 6.5: SentencePiece — language-agnostic
  • 6.6: Byte-level BPE detayda (GPT family)
  • 6.7: tiktoken production usage
  • 6.8: HuggingFace tokenizers — Rust-backed library
  • 6.9: Türkçe tokenization derin — production tips
  • 6.10: Custom domain tokenizer — code, biomedical, legal

Capstone C13#

Modül 6'nın sonunda TurkTokenizer capstone: Türkçe için optimize tokenizer (Modül 6.9-6.10'da yapıyoruz, sonuç HuggingFace Hub'a yayın).

12. Mini Egzersizler#

  1. Vocab inspection: 1000 vocab Türkçe corpus tokenizer eğit. İlk 20 learned token nedir?
  2. Compression ratio: 'Yapay zeka teknolojisi geliştirilmektedir.' — 5K vocab tokenizer ile token sayısı?
  3. Pre-tokenization etkisi: BasicBPE vs RegexBPE aynı corpus, same vocab size → compression farkı?
  4. Türkçe morfoloji: Tokenizer 'evlerimizden' kelimesini nasıl böler? Beklenti vs gerçek.
  5. Production karar: 100 req/s, Türkçe customer support. Custom BPE mi production library mi?

Bu Derste Neler Öğrendik?#

Karpathy minbpe felsefesi + Türkçe extension ✓ BasicBPE class sıfırdan implementation ✓ Training function Sennrich algorithm pure Python ✓ Encode/decode roundtrip verification ✓ Regex pre-tokenization GPT-2 style + Türkçe-aware ✓ Byte-level extension ve special tokens ✓ Türkçe corpus training practical example ✓ Karşılaştırma: Custom vs Trendyol-LLM vs GPT-4o ✓ Save/load + production patternsLimitations + Modül 6.4+ köprü

Sıradaki Ders#

6.4 — WordPiece: BERT'in Seçimi ve Likelihood-Based Subword BPE alternative. Google'ın WordPiece algorithm'i: merge by likelihood, BERT family kullanımı. Niye BPE'den biraz farklı, hangi durumda tercih.

Sık Sorulan Sorular

**Hayır, pedagojik only**. Production'da: HuggingFace tokenizers (Rust), tiktoken (Rust), SentencePiece (C++). Pure Python single-thread BPE ~1MB/sn encoding hızı. Production gerekli 100MB/sn+. Modül 6.3 implementation **anlamayı** sağlıyor — production değil. Modül 6.8 production tokenizer'ları detayda.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler