How different is OpenAI's new tokenizer (o200k) from the old cl100k?

Important improvements: (1) **Vocab size**: 100K → 200K. (2) **Multilingual coverage**: especially improved Chinese, Korean, Turkish, Hindi (~20-40% fewer tokens). (3) **Number tokenization**: each digit separate (critical for math). (4) **Code tokenization**: better handling of keywords (whitespace, indentation). (5) **Glitch token reduction**: cleaned up. Used by GPT-4o and beyond. Practical effect ~25% token savings for Turkish.

Is Anthropic Claude's tokenizer different from OpenAI's?

Yes — Anthropic uses its own tokenizer (custom BPE). Specifics not public but: (1) **Multilingual** focused. (2) **Code-friendly**. (3) **Token count** ~10-20% different from OpenAI (usually Anthropic slightly more efficient on code). Both APIs have **their own token counting** endpoints. Practical: when comparing costs, measure each one's tokenization, don't count words.

Are there defenses against glitch token attacks?

Several approaches: (1) **Tokenizer audit**: detect untrained tokens in vocab (embedding norm anomaly). (2) **Input sanitization**: glitch token detection rules, auto-cleaning. (3) **Embedding analysis**: in production check embeddings of incoming tokens — flag outliers. (4) **Adversarial training**: during fine-tuning include glitch tokens so model behaves normally. Module 55 (Red Teaming) details this topic.

Is tokenization performance critical in long context models?

Very critical. In models with 1M token context windows (Gemini 1.5 Pro, Claude 200K, GPT-5): (1) **Token efficiency** directly determines **effective context length**. Turkish 1M nominal = 400K effective. (2) **Attention complexity**: O(T²) means quadratic compute as sequence grows. Fewer tokens = much faster. (3) **KV cache**: memory size = tokens × layers × hidden. Turkish needs 2.5x more KV cache memory. Turkish-tuned tokenizer essential for long context.

Multilingual model vs monolingual Turkish model — which is advantageous?

Trade-off. (1) **Multilingual** (Llama 3, Qwen): broad knowledge, code, multilingual capability, but not Turkish-efficient. (2) **Turkish-tuned** (Trendyol-LLM, BERTurk variants): Turkish-efficient but weak in English code/scientific knowledge. **Practical**: **hybrid approach** in production — multilingual base + Turkish-tuned tokenizer + Turkish instruction fine-tune. Best-of-both-worlds. Modules 17, 21, 58 detail.

Tokenization as Part of the Mental Model: Token Economics, Turkish Pitfalls, and Glitch Tokens

Q: How long does it take to develop a Turkish-optimized tokenizer?

One week full-time engineering. Steps: (1) **Corpus collection** (2-3 days): Turkish Wikipedia, academic, news, literature — ~50GB. (2) **BPE training** (1-2 days): HuggingFace tokenizers with vocab=32K-64K. (3) **Evaluation** (1-2 days): tokens/character ratio on test set, downstream task. (4) **Iterations** (remaining): tune hyperparameters. Quality can reach native BERTurk level. Module 6 and Capstone C13 (TurkTokenizer) detail this.

How token boundaries shape predictions, token economics in morphologically rich languages like Turkish, 'glitch tokens' like SolidGoldMagikarp, leading whitespace problem, token-level detail of prompt engineering. Practical foundation for Module 6 (Tokenization Microsurgery).

Şükrü Yusuf KAYA

50 min read

5/13/2026

Intermediate

Tokenization Zihinsel Modelin Parçası: Token Ekonomisi, Türkçe Tuzakları ve Glitch Tokens

🔤 'Bir token = bir kelime' yanlışı yıkılıyor

Tokenization, LLM'le çalışmanın 'görünmez katmanı'. Çoğu mühendis 'token sayısı önemli' diye bilir ama detayları kaçırır. 50 dakika sonra: niye 'merhaba' Türkçe modelde 1 token İngilizce modelde 3 token; niye GPT prompt'larda baştaki boşluk neyi değiştiriyor; niye SolidGoldMagikarp gibi tuhaf token'lar var; nasıl token-level analiz ile prompt'unu optimize edersin — hepsini bileceksin.

Ders Haritası#

Tokenization neden lazım? Bir LLM'in input/output dili
BPE algorithmasının özü (Modül 6'ya hızlı önizleme)
Token boundary effect: tahminleri nasıl şekillendiriyor
Türkçe token ekonomisi: korkunç gerçek
Leading whitespace tuzağı — " hello" vs "hello"
Special tokens: BOS, EOS, PAD, special role
Glitch tokens: SolidGoldMagikarp tarihi
Prompt'un token-level analizi
tiktoken
ve
tokenizers
pratik
Token-aware prompt engineering

1. Tokenization Neden Lazım?#

LLM'ler discrete sequence'lar üzerinde çalışır. Ama dil continuous (her uzunlukta, her dilde, her karakterde). Çözüm: discrete vocabulary.

Üç seviye#

Seviye	Vocab	Avantaj	Dezavantaj
Character	~100	Tüm string'ler işlenebilir	Çok uzun sequence
Word	~50K-100K	Anlamlı birim	OOV (out-of-vocabulary), morfoloji
Subword (BPE/WordPiece)	~30K-128K	Sweet spot	Algoritma seçim

Modern LLM'ler subword: BPE (GPT, Llama), WordPiece (BERT), SentencePiece (T5, multilingual).

LLM'in dünya algısı#

Bir LLM dili kelime, harf, kavram olarak görmüyor. Token ID'leri görüyor — integer'lar. Şöyle:

"Merhaba dünya!" → tokenize → [42, 7891, 287, 0] (4 token)
                  → vocab lookup → 4 embedding vector
                  → forward pass
                  → predict next token ID

Her şey ID üzerinden. Tokenization, LLM'in 'çevirisi'.

2. BPE Algoritmasının Özü#

Detaylı Modül 6'da ama hızlı önizleme:

BPE adımları#

Başlangıç: tüm karakterler (veya bytes) vocab'ta
Tekrarla:
- Corpus'ta en sık görülen bigram'ı bul
- Bunu yeni bir token olarak vocab'a ekle
- Corpus'ta bu bigram'ı yeni token ile değiştir
Stop: vocab boyutu hedefe ulaşınca dur (örn. 50K)

Örnek#

Corpus: "low low low lower newer"

Başlangıç:
{l, o, w, e, r, n, ' '}
Sık bigram:
lo
→ vocab'a ekle,
low
→
lo+w
Sonraki sık:
er
Devam ederek:
low
,
lower
,
newer
gibi anlamlı birimler oluşuyor

Önemli özellikler#

Sub-word: kelimenin yapı taşlarına ayrılır
Frequency-driven: yaygın kelimeler tek token, nadirler çok token
No OOV: en azından karakter seviyesine inebilir
Language-specific: corpus dağılımına bağlı

Pratik etki#

Aynı kelimenin İngilizce ve Türkçe tokenizer'da farklı tokenize edilmesi tamamen corpus dağılımının sonucu.

3. Token Boundary Effect — Tahminleri Nasıl Şekillendiriyor#

Modelin tahmini token boundary'sine çok hassas. Aynı string'in farklı tokenization'ı farklı output verebilir.

Klasik örnek (GPT-2/3 döneminden)#

Sayılar:

"123" → [123] (single token)
"1234" → [12, 34] (2 token)
"12345" → [12, 345] (2 token)
"123456" → [123, 456] (2 token)

Sayının nasıl tokenize edildiği model'in matematiksel başarısını etkiliyor! GPT-3 niye basit aritmetikte çuvallıyordu — kısmen bu yüzden. GPT-4+ ve Llama 3 sayı tokenization'ını düzeltti (her digit ayrı token).

Türkçe örnek#

"İstanbul" tek başına → ["İstanbul"] tek token (yaygın)
"İstanbulluyum" → ["İstanbul", "lu", "yum"] (3 token)
"İstanbulluymuşsunuz" → ["İstanbul", "lu", "ymuş", "sunuz"] (4 token)

Morfeme uygun tokenization ideal ama BPE bunu garanti etmiyor — corpus distribution'a bağlı.

Spelling sorunları#

GPT-4'e "How many 'r's in 'strawberry'?" → uzun süre yanlış cevap verdi. Niye?

strawberry

token olarak

[stra, w, berry]

veya

[straw, berry]

halinde. Model kelimenin karakterlerini saymıyor — tokenları sayıyor. Doğru cevap için ya detaylı reasoning ya da tool use (Python) gerekiyor.

python

# Klasik tokenization deneyleri
import tiktoken
 
enc_gpt2 = tiktoken.get_encoding("gpt2")
enc_gpt4 = tiktoken.get_encoding("cl100k_base")        # GPT-3.5/4
enc_gpt5 = tiktoken.get_encoding("o200k_base")          # GPT-4o, GPT-5
 
def show_tokens(text, encoders):
    print(f"\n--- '{text}' ---")
    for name, enc in encoders:
        tokens = enc.encode(text)
        decoded = [enc.decode([t]) for t in tokens]
        print(f"{name:15} ({len(tokens)} tokens): {decoded}")
 
encoders = [("GPT-2", enc_gpt2), ("GPT-4 cl100k", enc_gpt4), ("GPT-5 o200k", enc_gpt5)]
 
# İngilizce vs Türkçe karşılaştırma
show_tokens("Hello world", encoders)
show_tokens("Merhaba dünya", encoders)
 
# Sayı tokenization evrim
show_tokens("12345", encoders)
show_tokens("1234567890", encoders)
 
# Türkçe morfoloji testi
show_tokens("evlerimden", encoders)
show_tokens("Türkiye'nin başkenti", encoders)
 
# Karakter sayma problemi
show_tokens("strawberry", encoders)

Tokenization karşılaştırması — GPT-2, GPT-4, GPT-5 tokenizer'ları.

4. Türkçe Token Ekonomisi — Korkunç Gerçek#

Türkçe morfolojik olarak zengin (agglutinative). Bir kök + çok ek = tek kelime. BPE bunu kelimeler bütün olarak tokenize etmek için yeterince corpus görmüyor (çünkü her morfoloji kombinasyonu nadir).

Sayısal etki (GPT-4 cl100k_base)#

İngilizce	Token	Türkçe	Token
"house"	1	"ev"	1
"my house"	2	"evim"	3
"in my house"	3	"evimde"	4
"from my house"	3	"evimden"	4
"from my houses"	4	"evlerimden"	5

Türkçe karakter sayısı az ama token sayısı yüksek. Tipik oran: 2-3x daha çok token.

Maliyet etkisi#

OpenAI API fiyatları token başına. GPT-5 input $1.25/1M. Aynı paragrafı:

İngilizce: 100 token → $0.000125
Türkçe: 250 token → $0.0003

Türkçe kullanım %150-200 daha pahalı. Yıllık 10K istek için $100-200 fark.

Context length etkisi#

GPT-5 context window 1M token. Türkçe ile %50 efektif context (500K karakter). Long context için kritik.

Çözüm#

Türkçe-optimized tokenizer: BERTurk, Türkçe BPE — ~%30-50 tasarruf
Multilingual tokenizer: mistral, qwen, llama 3 — biraz daha iyi
Custom domain tokenizer: spesifik corpus'ta train — büyük tasarruf

Modül 6'da detaylı tokenizer eğitimi yapıyoruz.

5. Leading Whitespace Tuzağı#

Modern BPE tokenizer'lar boşluk dahil token üretir:

" Ankara" → [" Ankara"] (1 token, leading space dahil)
"Ankara" → ["Ank", "ara"] (2 token, leading space yok)

Niye? İngilizce'de kelimeler genelde boşlukla başlar ('I want the answer'). Tokenizer "the" yerine " the" formunu sık görüyor → birleşik.

Prompt'ta etki#

Prompt: "The capital is " (sonu boşluk)
Modelin beklediği: "Ankara" (boşluksuz) — çünkü " Ankara" yerine başka bir token aramaya başlar

Tersi:

Prompt: "The capital is" (boşluksuz)
Modelin beklediği: " Ankara" (boşluklu) — doğal devam

Hata: trailing space koyma#

Birçok beginner prompt sonuna boşluk koyar. Bu modelin distribution'ını bozar — daha yüksek loss, daha kararsız üretim.

Doğru pattern#

prompt = "The capital is"           # ✓ son boşluk yok
prompt = "The capital is Ank"        # ✓ devam edecek model "ara" üretir
prompt = "The capital is " + ...     # ✗ trailing space — kaçın

Chat template'lerin çözümü#

Modern chat template'ler bu sorunu baştan çözüyor —

<|user|>...<|assistant|>

gibi special tokenlarla. Ama raw completion API kullanıyorsan dikkatli ol.

6. Special Tokens — BOS, EOS, PAD ve Daha Fazla#

Tokenizer'larda normal kelime tokenları + special tokens var.

Yaygın special tokens#

Token	Anlam	Kullanım
`<	begin_of_text	> `veya` ~~` (BOS)~~
`<	end_of_text	> `veya` ` (EOS)
`<	pad	>` (PAD)
`<	unk	>` (UNK)
`<	user	> `,` <
`<	im_start	> `,` <
`<	tool_call	> `,` <

Chat template örneği (Llama 3)#

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi! How can I help you?<|eot_id|>

Bu format explicit chat structure'ı tokenizer'a tanıtır. Model fine-tuning'de bu pattern'i öğrenmiştir.

Pratik#

HuggingFace'de:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hi!"},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)   # Llama 3 chat format ile dolu

Modül 20 (SFT) chat template'leri detaylandırıyor.

7. Glitch Tokens — SolidGoldMagikarp Tarihi#

Şubat 2023'te Twitter'da bir araştırmacı (Jessica Rumbelow) tuhaf bir keşif yaptı: GPT-3

SolidGoldMagikarp

token'ını gördüğünde garip davranıyor — random kelimeler üretiyor, kafası karışıyor.

Olay#

SolidGoldMagikarp

Reddit'te bir kullanıcı adı. BPE corpus'ta Reddit yoğun, bu username çok sık geçtiği için BPE onu tek token yaptı.

Sonra: training data filtering (toxic content removal vb.) bu Reddit'i çıkardı. Sonuç: token vocab'ta var ama hiç eğitim verisinde görülmedi → embedding rastgele kaldı → model bu token'a tepki chaotic.

Diğer glitch tokens#

petertodd
(Bitcoin developer)
davidjl
(Reddit username)
Streamerbot
(Twitch bot)
StreamerBot
(case sensitive variant)
Birçok Reddit username

Pratik etki#

Prompt injection: kötü niyetli kişi glitch token'ı prompt'a yerleştirerek modeli karıştırabilir
Hallucination kaynağı: nadir bir input bazen glitch token üretir
Tokenization audit: modern güvenli LLM'ler tokenizer audit yapıyor

Modern modellerde#

GPT-4 cl100k_base: çoğu glitch token temizlendi
GPT-5 o200k_base: tokenizer revize edildi
Llama 3, Qwen, Mistral: dikkatli tokenizer training

Senin için#

Token-level test yap: vocab'tan rastgele 1000 token al, modele tek tek ver, çıktısını gözlemle. Anomali glitch token belirtisi.

8. Token-Aware Prompt Engineering#

Profesyonel prompt engineering token-level düşünmeyi gerektirir.

1. Prompt'unu tokenize et#

Modern prompt yazımda gözle görmeden token sayısı tahmin etmek zor. Her zaman:

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
prompt = "Your long detailed prompt..."
print(f"Tokens: {len(enc.encode(prompt))}")

2. Token bütçesi planı#

GPT-5 context 1M, ama:

Input + output toplam
System prompt + user prompt + history + RAG context + output

Karmaşık agent'ta dikkat: 50K token system + 30K conversation + 100K RAG = 180K input → büyük cost.

3. Compression teknikleri#

Prompt compression: LLMLingua, AutoCompressor — %50-70 tasarruf
Caching: prompt prefix'i cache (Anthropic, OpenAI) — %90 cost savings cached için
Summary instead of full context: önceki mesajları LLM ile özetle

4. Whitespace ve format#

Trailing space yok
Markdown formatting token tüketir ama clarity'i artırır → trade-off
Inline code tokenization — backtick'ler ekstra token

5. Multi-language strategy#

Eğer ürün Türkçe + İngilizce ise:

System prompt İngilizce (token tasarrufu)
User content native dilde
Bir Türkçe input → bazen İngilizce-equivalent expand et

python

# Real-world: prompt token budget calculator
import tiktoken
 
enc = tiktoken.get_encoding("cl100k_base")
 
def analyze(label, text, output_tokens=200, model_price_in=1.25, model_price_out=10):
    in_tokens = len(enc.encode(text))
    cost = (in_tokens * model_price_in + output_tokens * model_price_out) / 1e6
    print(f"{label:30} | in: {in_tokens:5} tok | out: {output_tokens} tok | ${cost:.6f}")
 
# Senaryo: aynı bilgi farklı dillerde
en = "Please write a 100-word summary of the article about climate change and its impact on Turkish agriculture."
tr = "İklim değişikliğinin Türk tarımına etkisi hakkındaki makalenin 100 kelimelik özetini yaz."
 
analyze("İngilizce prompt", en)
analyze("Türkçe prompt", tr)
 
# RAG context simulation
context = "Sentence about climate." * 50
en_with_rag = en + "\n\nContext: " + context
analyze("EN + RAG", en_with_rag)
 
# Annual cost calculation
print("\nYıllık 10K istek için:")
print(f"  EN: {len(enc.encode(en)) * 10000 * 1.25 / 1e6 + 200 * 10000 * 10 / 1e6:.2f} USD")
print(f"  TR: {len(enc.encode(tr)) * 10000 * 1.25 / 1e6 + 200 * 10000 * 10 / 1e6:.2f} USD")

Token bütçesi hesaplama — gerçek dünya cost analizi.

9. Mini Egzersizler#

Token count tahmini: "Bir LLM mühendisinin günde 100 prompt yazdığını düşün." Türkçe ve İngilizce'de tahmini token sayısı?
Glitch token detection: Bir LLM tokenizer'ında bir token'ın 'glitch' olduğunu nasıl test edersin? Algoritma yaz.
Trailing space etkisi: "The Eiffel Tower is in " (boşluklu) vs "The Eiffel Tower is in" (boşluksuz) modeline ver. Hangisinin perplexity'si düşük olur?
Türkçe tokenization optimization: Bir e-ticaret şirketi GPT-5 ile Türkçe support yapacak. Yıllık maliyet $50K. Token tasarrufu için 3 strateji önerin.
Sayı tokenization: "1234567" GPT-2 ve GPT-5'te kaç token olur? Bu farkın matematik benchmark'larda etkisi nedir?

Bu Derste Neler Öğrendik?#

✓ Tokenization = LLM ile dünya arasındaki çeviri katmanı ✓ BPE algoritmasının özü — frequency-driven, sub-word ✓ Token boundary effect — sayılar, kelimeler, spelling ✓ Türkçe token ekonomisi — 2-3x maliyet, çözümler ✓ Leading whitespace tuzağı — trailing space yok ✓ Special tokens — BOS, EOS, chat template'leri ✓ Glitch tokens — SolidGoldMagikarp hikayesi ✓ Token-aware prompt engineering — pro practices ✓ Real-world cost calculation — Türkçe yıllık ek maliyet

Sıradaki Ders#

4.3 — Sampling Sanatı: Greedy, Beam, Top-K, Top-P, Temperature, Min-P Derinlemesine Ders 1.5'te sampling temellerini gördük. Şimdi production-level detaylar: hangi parameter ne yapıyor, repetition penalty, beam search ne zaman gerekli, structured output için sampling, modern reasoning modellerinde sampling.

Frequently Asked Questions

One week full-time engineering. Steps: (1) **Corpus collection** (2-3 days): Turkish Wikipedia, academic, news, literature — ~50GB. (2) **BPE training** (1-2 days): HuggingFace tokenizers with vocab=32K-64K. (3) **Evaluation** (1-2 days): tokens/character ratio on test set, downstream task. (4) **Iterations** (remaining): tune hyperparameters. Quality can reach native BERTurk level. Module 6 and Capstone C13 (TurkTokenizer) detail this.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası#

1. Tokenization Neden Lazım?#

Üç seviye#

LLM'in dünya algısı#

2. BPE Algoritmasının Özü#

BPE adımları#

Örnek#

Önemli özellikler#

Pratik etki#

3. Token Boundary Effect — Tahminleri Nasıl Şekillendiriyor#

Klasik örnek (GPT-2/3 döneminden)#

Türkçe örnek#

Spelling sorunları#

4. Türkçe Token Ekonomisi — Korkunç Gerçek#

Sayısal etki (GPT-4 cl100k_base)#

Maliyet etkisi#

Context length etkisi#

Çözüm#

5. Leading Whitespace Tuzağı#

Prompt'ta etki#

Hata: trailing space koyma#

Doğru pattern#

Chat template'lerin çözümü#

6. Special Tokens — BOS, EOS, PAD ve Daha Fazla#

Yaygın special tokens#

Chat template örneği (Llama 3)#

Pratik#

7. Glitch Tokens — SolidGoldMagikarp Tarihi#

Olay#

Diğer glitch tokens#

Pratik etki#

Modern modellerde#

Senin için#

8. Token-Aware Prompt Engineering#

1. Prompt'unu tokenize et#

2. Token bütçesi planı#

3. Compression teknikleri#

4. Whitespace ve format#

5. Multi-language strategy#

9. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

How long does it take to develop a Turkish-optimized tokenizer?

How different is OpenAI's new tokenizer (o200k) from the old cl100k?

Is Anthropic Claude's tokenizer different from OpenAI's?

Are there defenses against glitch token attacks?

Is tokenization performance critical in long context models?

Multilingual model vs monolingual Turkish model — which is advantageous?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight