Custom tokenizer ne kadar veri ile train etmeli?

Türkçe için: **1-10 GB** corpus (HuggingFace tokenizers Rust 10 GB'ı saatler içinde işliyor). Pure Python pedagogical: **1-100 MB** practical (büyük ise saatler). Vocab size'a göre değişir: 10K vocab → 100 MB yeter, 64K vocab → 1-10 GB ideal. Karpathy minbpe pedagogical for under 10 MB.

Tokenizer eğitirken hangi corpus kullanmalıyım Türkçe için?

Türkçe için **çoklu kaynak**: (1) **Türkçe Wikipedia** (~5GB). (2) **OSCAR Türkçe corpus** (10GB+). (3) **Türkçe news scraping** (Hürriyet, Milliyet archives). (4) **Subject domain'lerine göre**: hukuk corpus (Yargıtay decisions), kod (TR comments), edebiyat. **Modern öneri**: balanced mix → general Türkçe coverage. Domain-specific tokenizer için ayrı corpus. Modül 6.10 detay.

Trendyol-LLM tokenizer'ı niye production-grade ve nasıl train edildi?

**Trendyol-LLM tokenizer**: ~32K vocab, BPE-based, Türkçe-balanced corpus (Trendyol product descriptions, customer support, news, Wikipedia). HuggingFace tokenizers library ile Rust-backed train. Production-grade: (1) Production traffic'te tested, (2) edge cases handled (apostrophe, special chars), (3) special tokens for chat format. Trendyol AI ekibi 2023-2024 Türkçe optimization work — fine-tuned Llama 2 + custom tokenizer combination. Modül 6.9 detay.

Vocab size 1000 vs 10K vs 100K Türkçe corpus'ta hangisi ne kadar fark eder?

Practical comparison Türkçe Wikipedia: **Vocab 1K**: avg 4-6 token/word (high fragmentation). **Vocab 10K**: 2-3 token/word (good). **Vocab 32K**: 1.3-1.8 token/word (Trendyol-LLM level, optimal). **Vocab 100K**: 1.2-1.5 token/word (diminishing returns, model parameter cost artıyor). Sweet spot: 32K-64K Türkçe-only model için. Multilingual: 128K+.

Modül 6.3 capstone projesi var mı?

Modül 6 final capstone (Modül 6.10 sonunda): **TurkTokenizer-tr** — kendi optimize Türkçe tokenizer'ını eğit + HuggingFace Hub'a yayınla. Capstone C13 (Modül 21 ön bahsedildi). Bu Modül 6.3 capstone için **prototype**: pure Python pedagogical. Sonra 6.8'de Rust production version. C13 sonucu: \`huggingface.co/sukruyusufkaya/turk-tokenizer-tr-v1\` üzerinde yayın.

BPE'yi 200 Satırda Sıfırdan Yaz: Training + Encoding + Decoding + Türkçe Corpus

Q: Pure Python BPE production'da kullanılabilir mi?

**Hayır, pedagojik only**. Production'da: HuggingFace tokenizers (Rust), tiktoken (Rust), SentencePiece (C++). Pure Python single-thread BPE ~1MB/sn encoding hızı. Production gerekli 100MB/sn+. Modül 6.3 implementation **anlamayı** sağlıyor — production değil. Modül 6.8 production tokenizer'ları detayda.

Karpathy minbpe stil sıfırdan implementation: pure Python BPE training (Sennrich algorithm), encoding/decoding, regex pre-tokenization, byte-level extension, Türkçe corpus üzerinde train + Trendyol-LLM ile karşılaştırma. Modern LLM tokenizer'larını pratik anlama.

Şükrü Yusuf KAYA

60 dakikalık okuma

13.05.2026

İleri

BPE'yi 200 Satırda Sıfırdan Yaz: Training + Encoding + Decoding + Türkçe Corpus

💻 'Sıfırdan yaz' felsefesinin tokenization versiyonu

Modül 6.2'de BPE'yi matematiksel olarak inceledik. Şimdi gerçekten yazıyoruz. Karpathy'nin minbpe felsefesi: ~200 satır pure Python ile her şey. Sonunda Türkçe corpus üzerinde train edip token efficiency'i Trendyol-LLM ve OpenAI'ya karşılaştıracağız. 60 dakika sonra: BPE'yi 'kara kutu' olarak değil, kendi yazdığın motor olarak göreceksin.

Ders Haritası#

Implementation strategy: Karpathy minbpe + Türkçe extension
BasicBPE class — vocab, merge rules state
Training function — Sennrich algorithm
Encode function — text → token IDs
Decode function — token IDs → text
Regex pre-tokenization ekleme
Byte-level extension ekleme
Türkçe corpus'ta train — 1-10MB sample
Trendyol-LLM vs OpenAI vs custom karşılaştırma
Save/load + production patterns
Limitations + sonraki adım

1. Implementation Strategy#

Karpathy minbpe felsefesi#

Pure Python (Rust gerek değil)
Pedagogical clarity > production speed
~200 line total
Three classes: BasicBPE, RegexBPE, GPT4Tokenizer

Bizim plan#

Aynı yapı ama Türkçe extension ile:

TurkceBasicBPE: pure character-level BPE
TurkceRegexBPE: GPT-style pre-tokenization + Türkçe-aware
TurkceByteLevelBPE: byte-level + Türkçe corpus

Türkçe corpus üzerinde train, comparison.

Why pure Python?#

Production HuggingFace tokenizers Rust ile — saatler yerine dakikalar. Ama pedagogical learning için Python OK. Modül 6.8'de production Rust pattern.

2. BasicBPE Class — Vocab + Merge Rules#

class TurkceBasicBPE:
    def __init__(self):
        self.merges = {}      # (int, int) -> int
        self.vocab = {}       # int -> bytes
        self._init_base_vocab()

    def _init_base_vocab(self):
        # 256 bytes as initial vocab
        for i in range(256):
            self.vocab[i] = bytes([i])

    def __repr__(self):
        return f"TurkceBasicBPE(vocab_size={len(self.vocab)}, merges={len(self.merges)})"

State#

vocab
:
int -> bytes
mapping (token ID -> byte sequence)
merges
:
(int, int) -> int
(pair -> new token ID)

Initial vocab#

256 byte values (0-255). Byte-level BPE'nin temeli. Her UTF-8 byte tek token.

3. Training Function — Sennrich Algorithm#

def train(self, text, vocab_size, verbose=False):
    assert vocab_size >= 256
    num_merges = vocab_size - 256

    # Adım 1: text → byte sequence
    text_bytes = text.encode("utf-8")
    ids = list(text_bytes)  # initially: int 0-255 sequence

    # Adım 2: iterative merge
    merges = {}
    for i in range(num_merges):
        # 2a. Count adjacent pairs
        stats = {}
        for j in range(len(ids) - 1):
            pair = (ids[j], ids[j + 1])
            stats[pair] = stats.get(pair, 0) + 1

        if not stats:
            break

        # 2b. Find best pair
        best_pair = max(stats, key=stats.get)
        new_id = 256 + i

        # 2c. Apply merge
        ids = self._merge(ids, best_pair, new_id)

        # 2d. Save merge rule
        merges[best_pair] = new_id

        # 2e. Update vocab
        self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]

        if verbose and i % 100 == 0:
            print(f"Step {i}: merged {best_pair} -> {new_id}, "
                  f"freq={stats[best_pair]}, ids_len={len(ids)}")

    self.merges = merges

def _merge(self, ids, pair, new_id):
    """Replace all occurrences of pair with new_id."""
    new_ids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            new_ids.append(new_id)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1
    return new_ids

Analiz#

text.encode("utf-8")
→ byte sequence, her byte 0-255 int
_merge
helper → corpus replace pair with new ID
vocab[new_id]
→ bytes concatenation (göstermek için, encoding'de unused)

Pure Python complexity#

O(V × N). 1MB corpus, V=1000: ~10^9 ops. Single thread Python: dakikalar. Rust: saniyeler. Pedagogical OK.

4. Encode Function — Text → Token IDs#

def encode(self, text):
    """Convert text to token IDs."""
    text_bytes = text.encode("utf-8")
    ids = list(text_bytes)  # initial byte sequence

    while len(ids) >= 2:
        # Find pair with lowest merge index (earliest learned)
        stats = self._get_stats(ids)
        # Among all current pairs, find which one's in merges with lowest priority
        pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))

        if pair not in self.merges:
            break  # no more merges applicable

        idx = self.merges[pair]
        ids = self._merge(ids, pair, idx)

    return ids

def _get_stats(self, ids):
    """Count pairs in ids."""
    stats = {}
    for pair in zip(ids, ids[1:]):
        stats[pair] = stats.get(pair, 0) + 1
    return stats

Strategi#

Naive: training order'da merge rules uygula → O(V × L). Slow.

Optimized: her step'te uygulanabilir rules arasında earliest learned olanı seç → priority queue benzeri. Daha hızlı.

Complexity#

O(L² log L) worst case. Production tokenizer'lar (tiktoken) O(L log L) with smart data structures. Pedagogical Python OK.

5. Decode Function — Token IDs → Text#

def decode(self, ids):
    """Convert token IDs back to text."""
    text_bytes = b"".join(self.vocab[idx] for idx in ids)
    text = text_bytes.decode("utf-8", errors="replace")
    return text

Simple ama dikkat#

self.vocab[idx]
→ byte sequence o token için
b"".join
→ concatenate
decode("utf-8")
→ text

`errors="replace"`
#

UTF-8 invalid byte sequence'lar olabilir (training corpus'ta corrupt data).

errors="replace"

ile crash etmeden

�

ile değiştir.

Roundtrip test#

text = "Merhaba dünya!"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
assert decoded == text, f"Roundtrip failed: {text!r} → {decoded!r}"

Iyi training'de roundtrip bit-exact. Some edge cases:

Surrogate pairs
Combining characters
Whitespace normalization

python

# Tam BasicBPE class — ~50 line Python
class TurkceBasicBPE:
    def __init__(self):
        self.merges = {}
        self.vocab = {i: bytes([i]) for i in range(256)}
 
    def train(self, text, vocab_size, verbose=False):
        assert vocab_size >= 256
        num_merges = vocab_size - 256
        ids = list(text.encode("utf-8"))
 
        for i in range(num_merges):
            stats = self._get_stats(ids)
            if not stats:
                break
            best_pair = max(stats, key=stats.get)
            new_id = 256 + i
            ids = self._merge(ids, best_pair, new_id)
            self.merges[best_pair] = new_id
            self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]
            if verbose and i % 100 == 0:
                print(f"Step {i}: {best_pair} -> {new_id}, len={len(ids)}")
 
    def encode(self, text):
        ids = list(text.encode("utf-8"))
        while len(ids) >= 2:
            stats = self._get_stats(ids)
            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
            if pair not in self.merges:
                break
            ids = self._merge(ids, pair, self.merges[pair])
        return ids
 
    def decode(self, ids):
        return b"".join(self.vocab[idx] for idx in ids).decode("utf-8", errors="replace")
 
    @staticmethod
    def _get_stats(ids):
        stats = {}
        for pair in zip(ids, ids[1:]):
            stats[pair] = stats.get(pair, 0) + 1
        return stats
 
    @staticmethod
    def _merge(ids, pair, new_id):
        new_ids = []
        i = 0
        while i < len(ids):
            if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
                new_ids.append(new_id)
                i += 2
            else:
                new_ids.append(ids[i])
                i += 1
        return new_ids
 
 
# Test
text = "Merhaba dünya! Türkçe BPE tokenizer'ı sıfırdan yazıyoruz."
text_repeated = (text + " ") * 50  # 50x repeat for training data
 
tok = TurkceBasicBPE()
tok.train(text_repeated, vocab_size=300, verbose=True)
 
encoded = tok.encode("Merhaba dünya")
decoded = tok.decode(encoded)
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"Compression: {len('Merhaba dünya'.encode('utf-8'))} bytes → {len(encoded)} tokens")

Tam BasicBPE class — ~50 line, Türkçe corpus üzerinde test.

6. Regex Pre-tokenization Ekleme#

GPT-2 style pre-tokenization. Word boundary'leri yakala.

import regex as re

GPT2_PATTERN = re.compile(
    r"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"
)

class TurkceRegexBPE(TurkceBasicBPE):
    def __init__(self):
        super().__init__()
        self.pattern = GPT2_PATTERN

    def train(self, text, vocab_size, verbose=False):
        # Pre-tokenize first
        chunks = self.pattern.findall(text)
        # Each chunk → byte sequence
        chunk_ids = [list(chunk.encode("utf-8")) for chunk in chunks]

        # Initialize
        num_merges = vocab_size - 256

        # Iterative merge (within chunks, not cross-chunk)
        for i in range(num_merges):
            stats = {}
            for ids in chunk_ids:
                for pair in zip(ids, ids[1:]):
                    stats[pair] = stats.get(pair, 0) + 1
            if not stats:
                break
            best_pair = max(stats, key=stats.get)
            new_id = 256 + i

            # Apply merge to all chunks
            chunk_ids = [self._merge(ids, best_pair, new_id) for ids in chunk_ids]
            self.merges[best_pair] = new_id
            self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]

            if verbose and i % 100 == 0:
                total_tokens = sum(len(ids) for ids in chunk_ids)
                print(f"Step {i}: {best_pair} -> {new_id}, total tokens={total_tokens}")

    def encode(self, text):
        chunks = self.pattern.findall(text)
        all_ids = []
        for chunk in chunks:
            ids = list(chunk.encode("utf-8"))
            while len(ids) >= 2:
                stats = self._get_stats(ids)
                pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
                if pair not in self.merges:
                    break
                ids = self._merge(ids, pair, self.merges[pair])
            all_ids.extend(ids)
        return all_ids

Türkçe için pattern modification#

GPT-2 regex İngilizce apostrof handling — Türkçe'de yanlış. Türkçe-aware pattern:

TR_PATTERN = re.compile(
    r"\p{L}+(?:'\p{L}+)?| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+"
)

Türkçe apostrophe genelde kelimenin parçası (e.g., "İstanbul'da") — pattern bunu capture eder.

7. Byte-Level Extension#

Yukarıdaki implementation zaten byte-level —

text.encode("utf-8")

ile byte'a çeviriyoruz. Yeni özellik gerek değil.

Ama vocabulary size handling önemli:

class TurkceByteLevelBPE(TurkceRegexBPE):
    def __init__(self):
        super().__init__()
        # Token ID assignments:
        # 0-255: bytes
        # 256-V: learned merges
        # V+: special tokens (assigned after training)
        self.special_tokens = {}

    def register_special_tokens(self, tokens):
        """Register special tokens (assigned IDs after vocab)."""
        for tok in tokens:
            new_id = len(self.vocab) + len(self.special_tokens)
            self.special_tokens[tok] = new_id

    def encode_with_special(self, text, allowed_special=None):
        """Encode text, recognizing special tokens."""
        if allowed_special is None:
            allowed_special = set(self.special_tokens.keys())

        # Find special tokens in text, split around them
        # ... (complex implementation)

Special tokens#

<s>

</s>

<pad>

<|user|>

<|assistant|>

— bunlar BPE training'de hiç görmediği ID'ler. Manual inject.

Llama 3'te ~50 special token. Vocab 128K - special - BPE-learned.

8. Türkçe Corpus'ta Train#

Türkçe Wikipedia, news article veya literature corpus.

# 10 MB Türkçe corpus
with open("turkce_corpus.txt", "r", encoding="utf-8") as f:
    turkce_text = f.read()

print(f"Corpus size: {len(turkce_text) / 1024**2:.2f} MB")

tok = TurkceRegexBPE()
tok.train(turkce_text, vocab_size=1000, verbose=True)

Vocab inspection#

# Show learned merges
print("First 50 learned tokens:")
for i in range(256, 306):
    print(f"  {i}: {tok.vocab[i].decode('utf-8', errors='replace')!r}")

Beklenen output:

256: ' '
257: 'e '
258: 'er'
259: 'le'
260: ' bi'
261: 'an'
262: 'ar'
263: 'in'
264: 'ler'  ← Türkçe çoğul!
265: 'lar'  ← Türkçe çoğul!
...

Türkçe morfoloji pattern'leri ortaya çıkıyor.

Eval: Türkçe paragraph encoding#

test_text = """
Türkiye Cumhuriyeti, Avrupa ve Asya kıtaları arasında köprü konumundadır.
Anadolu yarımadasında 783,562 km² alana sahiptir.
"""

ids = tok.encode(test_text)
print(f"Text bytes: {len(test_text.encode('utf-8'))}")
print(f"Tokens: {len(ids)}")
print(f"Bytes per token: {len(test_text.encode('utf-8')) / len(ids):.2f}")

# Custom: ~2.0 byte/token (decent)
# GPT-4 cl100k: ~3-4 byte/token (multilingual)
# Türkçe-tuned: ~4-5 byte/token

9. Comparison: Custom vs Trendyol-LLM vs OpenAI#

# Custom tokenizer (our work)
custom_tok = TurkceRegexBPE()
custom_tok.train(turkce_text, vocab_size=10000)

# Trendyol-LLM tokenizer
from transformers import AutoTokenizer
trendyol_tok = AutoTokenizer.from_pretrained("Trendyol/Trendyol-LLM-7b-base-v1.0")

# OpenAI tiktoken (GPT-4o)
import tiktoken
gpt4o_tok = tiktoken.get_encoding("o200k_base")

# Test text
test = """
Yapay zeka, makine öğrenmesi alanında büyük ilerlemeler kaydetti.
Türkiye'de bu teknolojinin gelişimi son yıllarda hızlandı.
2026 yılı itibarıyla, frontier modeller Türkçe'yi neredeyse mükemmel anlıyor.
"""

custom_tokens = custom_tok.encode(test)
trendyol_tokens = trendyol_tok.encode(test, add_special_tokens=False)
gpt4o_tokens = gpt4o_tok.encode(test)

print(f"Text bytes: {len(test.encode('utf-8'))}")
print(f"Custom (10K vocab): {len(custom_tokens)} tokens")
print(f"Trendyol-LLM (32K vocab): {len(trendyol_tokens)} tokens")
print(f"GPT-4o o200k (200K vocab): {len(gpt4o_tokens)} tokens")

Beklenen sonuç#

Text bytes: 350
Custom (10K vocab): 110 tokens (~3.2 byte/tok)
Trendyol-LLM (32K vocab): 75 tokens (~4.7 byte/tok)
GPT-4o o200k (200K vocab): 95 tokens (~3.7 byte/tok)

Analiz#

Trendyol-LLM Türkçe için optimize → en kompakt (4.7 byte/tok)
GPT-4o o200k multilingual → orta (3.7 byte/tok). cl100k'dan iyi ama Trendyol kadar değil
Custom 10K vocab küçük + corpus küçük → en az kompakt

Türkçe-tuned'in avantajı#

Same paragraph:

Trendyol-LLM: 75 token → $0.001 (gpt-5-mini fiyatla)
GPT-4o: 95 token → $0.0013 (%27 daha pahalı)
Custom 10K: 110 token → $0.0014 (%47 daha pahalı)

Türk şirket için kendi Türkçe-tuned tokenizer maliyet tasarrufu.

10. Save / Load + Production Patterns#

import json
import base64

def save(self, path):
    data = {
        "merges": {f"{k[0]},{k[1]}": v for k, v in self.merges.items()},
        "vocab": {str(k): base64.b64encode(v).decode("ascii") for k, v in self.vocab.items()},
        "pattern": self.pattern.pattern if hasattr(self, "pattern") else None,
    }
    with open(path, "w") as f:
        json.dump(data, f)

def load(self, path):
    with open(path) as f:
        data = json.load(f)
    self.merges = {tuple(map(int, k.split(","))): v for k, v in data["merges"].items()}
    self.vocab = {int(k): base64.b64decode(v) for k, v in data["vocab"].items()}
    if data["pattern"]:
        self.pattern = re.compile(data["pattern"])

File size#

Vocab 10K, merges 10K → ~500 KB JSON. Production HuggingFace format daha compact (binary).

Production deployment#

Production'da:

Pre-load once: server boot'ta tokenizer load
Multi-thread encoding: HuggingFace tokenizers Rust'ta zaten thread-safe
Cache results: yaygın prompt'lar için
Validate: production roundtrip test (encode → decode → compare)

When to use custom Python BPE?#

Pedagogical / learning (this lesson)
Research prototyping
Small-scale projects (< 1K req/s)

When to use HuggingFace tokenizers / tiktoken (production):

Production, high QPS
Compatibility with existing models
Multi-thread efficiency

11. Limitations + Sonraki Adım#

Bizim implementation'ın sınırları#

Slow: pure Python, single thread → büyük corpus impractical
No multiprocessing for stats counting
Memory: tüm corpus RAM'de
No incremental training: full retrain gerekli yeni veri için
No subword regularization (Kudo 2018)

Production-grade alternatif#

HuggingFace tokenizers (Rust): hızlı, multi-thread, well-tested
tiktoken (Rust): OpenAI'nın production tokenizer
SentencePiece (C++): Google'ın multilingual tokenizer

Modül 6'nın geri kalanı#

6.4: WordPiece (BERT) — likelihood-based variant
6.5: SentencePiece — language-agnostic
6.6: Byte-level BPE detayda (GPT family)
6.7: tiktoken production usage
6.8: HuggingFace tokenizers — Rust-backed library
6.9: Türkçe tokenization derin — production tips
6.10: Custom domain tokenizer — code, biomedical, legal

Capstone C13#

Modül 6'nın sonunda TurkTokenizer capstone: Türkçe için optimize tokenizer (Modül 6.9-6.10'da yapıyoruz, sonuç HuggingFace Hub'a yayın).

12. Mini Egzersizler#

Vocab inspection: 1000 vocab Türkçe corpus tokenizer eğit. İlk 20 learned token nedir?
Compression ratio: 'Yapay zeka teknolojisi geliştirilmektedir.' — 5K vocab tokenizer ile token sayısı?
Pre-tokenization etkisi: BasicBPE vs RegexBPE aynı corpus, same vocab size → compression farkı?
Türkçe morfoloji: Tokenizer 'evlerimizden' kelimesini nasıl böler? Beklenti vs gerçek.
Production karar: 100 req/s, Türkçe customer support. Custom BPE mi production library mi?

Bu Derste Neler Öğrendik?#

✓ Karpathy minbpe felsefesi + Türkçe extension ✓ BasicBPE class sıfırdan implementation ✓ Training function Sennrich algorithm pure Python ✓ Encode/decode roundtrip verification ✓ Regex pre-tokenization GPT-2 style + Türkçe-aware ✓ Byte-level extension ve special tokens ✓ Türkçe corpus training practical example ✓ Karşılaştırma: Custom vs Trendyol-LLM vs GPT-4o ✓ Save/load + production patterns ✓ Limitations + Modül 6.4+ köprü

Sıradaki Ders#

6.4 — WordPiece: BERT'in Seçimi ve Likelihood-Based Subword BPE alternative. Google'ın WordPiece algorithm'i: merge by likelihood, BERT family kullanımı. Niye BPE'den biraz farklı, hangi durumda tercih.

Sık Sorulan Sorular

**Hayır, pedagojik only**. Production'da: HuggingFace tokenizers (Rust), tiktoken (Rust), SentencePiece (C++). Pure Python single-thread BPE ~1MB/sn encoding hızı. Production gerekli 100MB/sn+. Modül 6.3 implementation **anlamayı** sağlıyor — production değil. Modül 6.8 production tokenizer'ları detayda.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Modül 0: Kurs Çerçevesi ve Atölye Kurulumu

Ders Haritası#

1. Implementation Strategy#

Karpathy minbpe felsefesi#

Bizim plan#

Why pure Python?#

2. BasicBPE Class — Vocab + Merge Rules#

State#

Initial vocab#

3. Training Function — Sennrich Algorithm#

Analiz#

Pure Python complexity#

4. Encode Function — Text → Token IDs#

Strategi#

Complexity#

5. Decode Function — Token IDs → Text#

Simple ama dikkat#

errors="replace"#

Roundtrip test#

6. Regex Pre-tokenization Ekleme#

Türkçe için pattern modification#

7. Byte-Level Extension#

Special tokens#

8. Türkçe Corpus'ta Train#

Vocab inspection#

Eval: Türkçe paragraph encoding#

9. Comparison: Custom vs Trendyol-LLM vs OpenAI#

Beklenen sonuç#

Analiz#

Türkçe-tuned'in avantajı#

10. Save / Load + Production Patterns#

File size#

Production deployment#

When to use custom Python BPE?#

11. Limitations + Sonraki Adım#

Bizim implementation'ın sınırları#

Production-grade alternatif#

Modül 6'nın geri kalanı#

Capstone C13#

12. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Sık Sorulan Sorular

Pure Python BPE production'da kullanılabilir mi?

Custom tokenizer ne kadar veri ile train etmeli?

Tokenizer eğitirken hangi corpus kullanmalıyım Türkçe için?

Trendyol-LLM tokenizer'ı niye production-grade ve nasıl train edildi?

Vocab size 1000 vs 10K vs 100K Türkçe corpus'ta hangisi ne kadar fark eder?

Modül 6.3 capstone projesi var mı?

Yorumlar & Soru-Cevap

İlgili İçerikler

LLM Engineer Kimdir? Junior'dan Staff'a Yapay Zekâ Mühendisliği Kariyer Haritası

Kurs Felsefesi: Neden Bu Yol, Neden Bu Sıra — 8 Aylık Müfredatın İskeleti

Atölye Kurulumu: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

`errors="replace"`
#