Can I do the capstone project from scratch myself?

Yes, training script is open. (1) Corpus collection 4 hours. (2) Cleaning 2 hours. (3) Training 30 min (16-core CPU). (4) Eval 1 hour. (5) Model card 2 hours. (6) HF Hub publish 1 hour. Total ~10 hours, one day. No GPU needed (CPU training).

Apache 2.0 vs MIT — why Apache 2.0 for TurkTokenizer-tr?

Apache 2.0: (1) Patent grant clause — protection from patent trolls. (2) More explicit about contribution terms. (3) HuggingFace ecosystem standard. (4) Better defensibility with Wikipedia CC-BY-SA training data alignment. MIT would also work but Apache 2.0 preferred for modern large projects.

Is comparison with Trendyol-LLM tokenizer fair?

Apples-to-apples-ish: both 32K BPE Turkish-tuned. Differences: Trendyol corpus commerce-heavy (product descriptions), TurkTokenizer-tr general TR. One may be better than the other depending on domain. Fair comparison: same test corpus (Wikipedia, news), same evaluation framework. Result: comparable, TurkTokenizer-tr open-source alternative.

Will it get academic citations?

Open-source tokenizer projects can get citations within 1-2 years (e.g., Trendyol-LLM paper, ALEFI Turkish corpus). Boost strategies: (1) Reproducible benchmarks. (2) Detailed paper draft (arXiv submit). (3) Conference workshop (Turkish NLP Workshop). (4) Cross-citation: reference in own blogs.

What kind of community feedback comes after v1.0 release?

Typical: (1) 'Fertility high on Test X' (domain coverage feedback). (2) 'Embedding initialization issue with this vocab' (custom model integration issue). (3) Question why suffix '-larımız' isn't single token (vocab budget trade-off). (4) Feature request: code-specific variant. (5) Bug: edge case 'pe§i§re' characters.

Should someone completing the curriculum do this capstone themselves?

YES, absolutely. The curriculum is the 'pedagogical path' but real competence is **producing your own artifact**. Capstone alone: portfolio item, GitHub stars, HF Hub presence, academic citing, job interview talking point. Completing curriculum without capstone is half the work.

What's the next step for a student completing Module 6?

Modules 7-13 (Part II — Transformer Architecture Skeleton). 8 modules: Embeddings, Attention Math, Position Encoding, Transformer Block, Modern Architectures (Gated Attention), Mixture of Experts, Alternatives (RetNet, Mamba), Pre-training Dynamics. Then Part III (Training & Scaling), Part IV (Fine-tuning), Part V (Deployment). Full curriculum 25-30 modules.

Capstone TurkTokenizer-tr: Train, Evaluate, and Publish a Production-Grade Turkish Tokenizer to HuggingFace Hub

The work of Module 6: train TurkTokenizer-tr (32K vocab Turkish BPE) from scratch, evaluate with 6.9 framework, write model card, choose license, publish to HuggingFace Hub. Corpus curation (Wikipedia + OSCAR + news + literature + code), cleaning pipeline, chat template, production integration, maintenance roadmap. Synthesis of Modules 6.1-6.9, real-world artifact.

Şükrü Yusuf KAYA

90 min read

5/13/2026

Advanced

Capstone TurkTokenizer-tr: Türkçe Production-Grade Tokenizer Eğit, Değerlendir ve HuggingFace Hub'a Yayınla

🏆 Müfredatın eseri — TurkTokenizer-tr ve HuggingFace Hub yayını

Modül 6'nın 9 dersinin sentezi: kendi Türkçe tokenizer'ını sıfırdan eğit (32K vocab BPE), bilimsel olarak değerlendir, dokümante et, HuggingFace Hub'a yayınla. Sonunda dünya açık-kaynak topluluğunda gerçek artefakt'ın olacak. Başkaları sen yazdığın model card'ı okuyacak, sen seçtiğin license ile kullanacak, sen yayımladığın benchmark'ı tartışacak. 90 dakika sonra: korpus toplama, temizleme, eğitim, değerlendirme, model card yazımı, HF Hub workflow, version management, maintenance planning'i baştan sona yapabileceksin. Bu, profesyonel açık-kaynak LLM mühendisi olma sertifikasıdır.

Capstone Akışı (12 Aşama)#

Hedef tanımı — niye TurkTokenizer-tr, kim için, hangi metric'le başarı
Corpus curation — Wikipedia + OSCAR + news + literature + code mix
Cleaning pipeline — HTML strip, dedup, language detect, quality filter
Training config — 32K BPE, byte_fallback, special tokens (chat-ready)
End-to-end training script — production-grade, 25 min on 16-core
Evaluation (6.9 framework) — fertility, BPC, OOV, downstream proxy
Comparison — Llama-3, GPT-4o, Trendyol-LLM, mBERT
Model card yazımı — Hugging Face standartına uygun
License seçimi — MIT vs Apache 2.0 vs CC-BY-SA, hangisi ne zaman
HuggingFace Hub workflow — repo oluşturma, upload, versioning
Production integration — transformers, vLLM, kendi pipeline'ın
Maintenance roadmap — versioning, community feedback, retraining schedule

1. Hedef Tanımı#

1.1 TurkTokenizer-tr nedir#

İsim: TurkTokenizer-tr (community-friendly, descriptive)
Algoritma: BPE (byte-level)
Vocab boyutu: 32,000
byte_fallback: True
Special tokens: chat-ready (ChatML compat)
Hedef dil: Türkçe (TR-only fine-tune'lar için)
License: Apache 2.0 (commercial-friendly)

1.2 Kim için#

Türkçe LLM fine-tune yapanlar (Llama-3'ün vocab'ını değiştirmek istemeyenler için base model swap alternatif)
Türkçe pre-training projeleri
Türkçe NER/QA/sentiment fine-tune (BERT-style)
Türkçe maliyet optimizasyonu peşinde olanlar
Akademik karşılaştırma için

1.3 Başarı kriterleri#

Fertility ≤ 1.25 (Wikipedia TR test set)
BPC ≤ 0.85 (proxy LM ile)
OOV rate ≤ %25
Cross-domain fertility ≤ 1.5 (legal, medical hariç)
HuggingFace Hub'da indexed + downloadable
Model card complete, reproducible

1.4 Zaman çizelgesi#

Corpus toplama: 4 saat
Cleaning: 2 saat
Training: 30 dakika
Evaluation: 1 saat
Model card + license: 2 saat
Hub publish: 1 saat
Total: 1 günlük capstone

1.5 Önceden var olan benzer projeler#

Trendyol-LLM tokenizer (closed, comparable)
BERT-base-Turkish-cased (WordPiece, BERT-style)
mukayese (Türkçe NLP toolkit, fragmente)

TurkTokenizer-tr boşluk dolduran ilk fully-open, fully-documented, fully-reproducible Türkçe BPE 32K tokenizer.

2. Corpus Curation#

2.1 Kaynak seçimi (10 GB hedef)#

Kaynak	Boyut	Lisans	Domain
Türkçe Wikipedia (trwiki)	4 GB	CC-BY-SA	General
OSCAR-tr 2023	5 GB (subset)	CC0	Web crawl
Türkçe news (BounWebCorpus)	1 GB	Academic	News
Türkçe literature (Gutenberg + Wattpad permitted)	500 MB	Mixed	Literature
Türkçe code comments (GitHub permissively-licensed)	500 MB	Mixed	Code

2.2 Download script#

from datasets import load_dataset

# Wikipedia
wiki = load_dataset("wikipedia", "20240401.tr", split="train")
wiki.to_parquet("corpus/wiki-tr.parquet")

# OSCAR
oscar = load_dataset("oscar-corpus/OSCAR-2301", "tr", split="train", streaming=True)
with open("corpus/oscar-tr.txt", "w", encoding="utf-8") as f:
    for i, item in enumerate(oscar):
        if i >= 5_000_000:  # ~5 GB
            break
        f.write(item["text"] + "\n")

2.3 License compliance#

Wikipedia (CC-BY-SA): attribution gerekli, derived work share-alike
OSCAR (CC0): no restriction, ideal
Code (mixed): filter permissive licenses (MIT, Apache, BSD) — GPL exclude
News (Academic): research/educational only — commercial use için ayrıca

TurkTokenizer-tr için Apache 2.0 license seçiyoruz, bu Wikipedia CC-BY-SA ile compatible (derived work scope: tokenizer training corpus, vocab artifact).

2.4 Domain balance#

Ideal mix (10 GB):

Wikipedia: %40
OSCAR (general web): %35
News: %10
Literature: %5
Code: %5
Other: %5

2.5 Türkçe-spesifik considerations#

Ünlü uyumu coverage: tüm sufiks variantları temsil edilsin
Mixed Latin script: İngilizce loan-words, technical terms
Modern + Ottoman: optional Ottoman corpus for historical text support
Dialects: standart İstanbul Türkçesi tercih, dialektler under-representation OK
Apostrof: 'da, 'nın gibi suffixleri ortaya çıkaran corpus seç

3. Cleaning Pipeline#

3.1 Steps#

import re
import html
from langdetect import detect

def clean_pipeline(text):
    # 1. HTML strip
    text = re.sub(r"<[^>]+>", "", text)
    text = html.unescape(text)
    
    # 2. Whitespace normalize
    text = re.sub(r"\s+", " ", text)
    text = text.strip()
    
    # 3. Min length filter
    if len(text) < 50:
        return None
    
    # 4. Language detect
    try:
        if detect(text) != "tr":
            return None
    except:
        return None
    
    # 5. Quality filter (alphanumeric ratio)
    alnum = sum(1 for c in text if c.isalnum())
    if alnum / len(text) < 0.5:
        return None
    
    # 6. PII filter (email, phone — basit)
    text = re.sub(r"\S+@\S+", "<EMAIL>", text)
    text = re.sub(r"\+?\d[\d\s\-]{8,}\d", "<PHONE>", text)
    
    return text

3.2 Deduplication#

import hashlib
seen_hashes = set()

def dedupe(text):
    h = hashlib.md5(text.encode("utf-8")).hexdigest()
    if h in seen_hashes:
        return None
    seen_hashes.add(h)
    return text

Exact dedup. Fuzzy dedup için MinHash + LSH (advanced).

3.3 Pipeline orchestration#

def process_corpus(input_files, output_file):
    with open(output_file, "w", encoding="utf-8") as out:
        for fname in input_files:
            with open(fname, encoding="utf-8") as f:
                for line in f:
                    cleaned = clean_pipeline(line)
                    if cleaned is None:
                        continue
                    deduped = dedupe(cleaned)
                    if deduped is None:
                        continue
                    out.write(deduped + "\n")

process_corpus(
    ["corpus/wiki-tr.txt", "corpus/oscar-tr.txt", "corpus/news-tr.txt"],
    "corpus/clean-tr.txt"
)

3.4 Quality metrics#

import collections

def corpus_stats(file):
    line_lengths = []
    word_counts = collections.Counter()
    with open(file, encoding="utf-8") as f:
        for line in f:
            line_lengths.append(len(line))
            for word in line.split():
                word_counts[word] += 1
    
    print(f"Lines: {len(line_lengths):,}")
    print(f"Avg line length: {sum(line_lengths)/len(line_lengths):.1f}")
    print(f"Unique words: {len(word_counts):,}")
    print(f"Total words: {sum(word_counts.values()):,}")
    
    return word_counts

3.5 Expected output#

10 GB raw → ~8 GB clean (post-dedup, post-quality filter). %20 düşüş normal.

3.6 Modern alternative: HF datasets dedupe#

from datasets import load_dataset
ds = load_dataset("text", data_files=["corpus/clean-tr.txt"])
ds = ds.unique("text")  # exact dedup
ds.to_csv("corpus/final-tr.csv")

4. Training Config#

4.1 Hyperparameters#

VOCAB_SIZE = 32000
MIN_FREQUENCY = 2
MAX_TOKEN_LENGTH = 24

SPECIAL_TOKENS = [
    "<|endoftext|>",       # GPT-style
    "<|pad|>",
    "<|im_start|>",        # ChatML
    "<|im_end|>",
    "<|user|>",
    "<|assistant|>",
    "<|system|>",
    "<|tool|>",
    "<|python_tag|>",      # tool use
    "<|eot_id|>",          # Llama-3 compat
    "<EMAIL>",             # cleaning placeholder
    "<PHONE>",
    "<URL>",
    "<NUMBER>",
]

4.2 Why these choices#

32K: Türkçe-only optimum. 50K marjinal kazanç (-%5 fertility), 2x embedding maliyeti.
byte_fallback=True: Llama-3 pattern. UNK eliminate.
special_tokens chat-ready: production fine-tune'larda direkt kullanım.
ChatML + Llama-3 special tokens her ikisi: cross-compat.

4.3 Normalizer#

from tokenizers.normalizers import Sequence, NFC, Replace, Strip

normalizer = Sequence([
    NFC(),
    Replace(r"[\u200B-\u200F\uFEFF\u2028\u2029]", ""),   # zero-width + line sep
    Replace(r"\s+", " "),
    Strip(),
])

4.4 PreTokenizer#

from tokenizers.pre_tokenizers import ByteLevel
pre_tokenizer = ByteLevel(add_prefix_space=True, use_regex=True)

4.5 Post-processor#

from tokenizers.processors import ByteLevel as ByteLevelProcessor
post_processor = ByteLevelProcessor(trim_offsets=True)

4.6 Decoder#

from tokenizers.decoders import ByteLevel as ByteLevelDecoder
decoder = ByteLevelDecoder()

4.7 Trainer#

from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(
    vocab_size=VOCAB_SIZE,
    min_frequency=MIN_FREQUENCY,
    special_tokens=SPECIAL_TOKENS,
    initial_alphabet=ByteLevel.alphabet(),
    show_progress=True,
    max_token_length=MAX_TOKEN_LENGTH,
)

4.8 Chat template#

Llama-3-Instruct uyumlu Jinja2:

CHAT_TEMPLATE = '''
{% for message in messages -%}
{%- if message.role == "system" -%}
<|im_start|>system
{{ message.content }}<|im_end|>
{%- elif message.role == "user" -%}
<|im_start|>user
{{ message.content }}<|im_end|>
{%- elif message.role == "assistant" -%}
<|im_start|>assistant
{{ message.content }}<|im_end|>
{%- endif -%}
{% endfor -%}
{%- if add_generation_prompt -%}
<|im_start|>assistant
{%- endif -%}
'''

Chat template ChatML format. Llama-3 format için ayrı variant gerekirse fork.

5. End-to-End Training Script (Production-Grade)#

#!/usr/bin/env python3
# train_turktokenizer.py

import os
import time
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.normalizers import Sequence, NFC, Replace, Strip
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.processors import ByteLevel as ByteLevelProcessor
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.trainers import BpeTrainer

# Config
VOCAB_SIZE = 32000
MIN_FREQUENCY = 2
MAX_TOKEN_LENGTH = 24
CORPUS_FILES = [
    "corpus/wiki-tr.txt",
    "corpus/oscar-tr-clean.txt",
    "corpus/news-tr.txt",
    "corpus/literature-tr.txt",
    "corpus/code-tr.txt",
]
OUTPUT_PATH = "turktokenizer-tr-32k.json"

SPECIAL_TOKENS = [
    "<|endoftext|>",
    "<|pad|>",
    "<|im_start|>",
    "<|im_end|>",
    "<|user|>",
    "<|assistant|>",
    "<|system|>",
    "<|tool|>",
    "<|python_tag|>",
    "<|eot_id|>",
    "<EMAIL>",
    "<PHONE>",
    "<URL>",
    "<NUMBER>",
]


def main():
    print("🚀 TurkTokenizer-tr eğitimi başlıyor...")
    start = time.time()
    
    # 1. Tokenizer init
    tokenizer = Tokenizer(BPE(unk_token=None, byte_fallback=True))
    
    # 2. Normalizer
    tokenizer.normalizer = Sequence([
        NFC(),
        Replace(r"[\u200B-\u200F\uFEFF\u2028\u2029]", ""),
        Replace(r"\s+", " "),
        Strip(),
    ])
    
    # 3. Pre-tokenizer
    tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=True, use_regex=True)
    
    # 4. Post-processor
    tokenizer.post_processor = ByteLevelProcessor(trim_offsets=True)
    
    # 5. Decoder
    tokenizer.decoder = ByteLevelDecoder()
    
    # 6. Trainer
    trainer = BpeTrainer(
        vocab_size=VOCAB_SIZE,
        min_frequency=MIN_FREQUENCY,
        special_tokens=SPECIAL_TOKENS,
        initial_alphabet=ByteLevel.alphabet(),
        show_progress=True,
        max_token_length=MAX_TOKEN_LENGTH,
    )
    
    # 7. Train
    tokenizer.train(CORPUS_FILES, trainer)
    
    # 8. Save
    tokenizer.save(OUTPUT_PATH)
    print(f"✅ Saved to {OUTPUT_PATH}")
    
    # 9. Smoke test
    test_text = "İstanbul Boğazı'nda balıkçı tekneleri sallanıyor."
    result = tokenizer.encode(test_text)
    print(f"Test text: {test_text}")
    print(f"Tokens ({len(result.ids)}): {result.tokens}")
    
    elapsed = time.time() - start
    print(f"⏱️  Training time: {elapsed:.0f} seconds")


if __name__ == "__main__":
    main()

5.1 Çalıştırma#

export RAYON_NUM_THREADS=16
python train_turktokenizer.py

5.2 Beklenen output#

🚀 TurkTokenizer-tr eğitimi başlıyor...
[1500/1500] 100% complete in 25:33
✅ Saved to turktokenizer-tr-32k.json
Test text: İstanbul Boğazı'nda balıkçı tekneleri sallanıyor.
Tokens (8): ['Ġİstanbul', 'ĠBoğazı', "'", 'nda', 'Ġbalıkçı', 'Ġtekneleri', 'Ġsallanıyor', '.']
⏱️  Training time: 1533 seconds

8 token, fertility ~1.2. Excellent.

5.3 transformers compat#

from transformers import PreTrainedTokenizerFast

fast = PreTrainedTokenizerFast(tokenizer_file="turktokenizer-tr-32k.json")
fast.chat_template = CHAT_TEMPLATE  # önceki bölümden
fast.save_pretrained("./turktokenizer-tr-32k-hf")
# Çıktı: tokenizer.json + tokenizer_config.json + special_tokens_map.json

6. Evaluation — 6.9 Framework Uygulaması#

6.1 Test suites#

TEST_CORPORA = {
    "wiki_test": load_text("test/wiki-tr-test.txt"),       # 1M token
    "news_test": load_text("test/news-tr-test.txt"),       # 500K
    "legal_test": load_text("test/legal-tr-test.txt"),     # 200K
    "medical_test": load_text("test/medical-tr-test.txt"), # 100K
    "flores_tr": load_flores("tr"),                         # paralel cümleler
    "flores_en": load_flores("en"),                         # cross-lingual
}

6.2 Comparison tokenizers#

TOKENIZERS = {
    "turktokenizer-tr": Tokenizer.from_file("turktokenizer-tr-32k.json"),
    "llama-3": AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B"),
    "gpt-4o": tiktoken.get_encoding("o200k_base"),  # wrapper
    "bert-tr": AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased"),
}

6.3 Run#

results = {}
for tname, tok in TOKENIZERS.items():
    for cname, corpus in TEST_CORPORA.items():
        key = f"{tname}_{cname}"
        results[key] = evaluate_tokenizer(tok, corpus)

import pandas as pd
df = pd.DataFrame(results).T
df.to_csv("eval/results.csv")
print(df)

6.4 Expected metrics#

Tokenizer	Wiki fertility	News	Legal	Medical	FLORES TR/EN
TurkTokenizer-tr	1.22	1.20	1.65	1.85	1.40
Llama-3	1.47	1.42	1.92	2.15	1.78
GPT-4o o200k	1.43	1.40	1.85	2.05	1.75
BERT-Turkish	1.55	1.50	2.10	2.40	N/A

TurkTokenizer-tr genelde en düşük fertility (Türkçe-tuned advantage).

6.5 Cost analysis#

1M API call equivalent (500 token/call):

Llama-3: 500 × 1M = 500M token (baseline)
TurkTokenizer-tr: 500 × 1M × (1.22/1.47) = 415M token (-%17)
Yıllık tasarruf (GPT-4o equivalent $2.5/1M rate):$ 213/ay = $2,550/yıl

6.6 Downstream task proxy (no full LM training)#

Vocab utilization on Turkish corpus
Subword coverage of common Turkish suffixes (-lar, -ler, -nın, -nin, -da, -de etc.)
Boundary preservation for named entities (city names, person names)

turk_suffixes = ["lar", "ler", "nın", "nin", "nun", "nün", "da", "de", "dan", "den", "a", "e", "ı", "i", "u", "ü"]
coverage = sum(1 for s in turk_suffixes if s in tokenizer.get_vocab())
print(f"Turkish suffix coverage: {coverage}/{len(turk_suffixes)}")
# 16/16 ideal

7. Comparison — Detaylı Benchmark#

7.1 Fertility heatmap (görsel)#

Matrix: rows=tokenizer, columns=domain. Heatmap colour: green (low fertility, good) → red (high fertility).

Production'da: matplotlib seaborn heatmap.

7.2 Cost projection table#

10 farklı use-case için yıllık cost karşılaştırması:

Use Case	Avg tokens/call	Llama-3 cost/year	TurkTokenizer cost/year	Tasarruf
Chatbot 1M calls	500	$1,250	$1,038	-$212
RAG 100K calls/day	2000	$182,500	$151,475	-$31,025
Document summarize 10K/day	5000	$456,250	$378,688	-$77,562

ROI: corpus + training time (1 gün) → yıllık $30K-100K tasarruf büyük projelerde.

7.3 Trendyol-LLM comparison#

Trendyol-LLM tokenizer kapalı kaynak ama HF Hub'da accesible. Karşılaştırma:

Trendyol-LLM: ~1.20 fertility Wikipedia
TurkTokenizer-tr: ~1.22 fertility Wikipedia
Comparable! TurkTokenizer-tr açık-kaynak alternative.

7.4 Qualitative analysis#

Örnek cümle bölünmeleri:

'İstanbul Boğazı'nda balıkçı tekneleri sallanıyor.'

Llama-3: ['Istanbul', 'Boğaz', 'ı', "'", 'nda', 'balık', 'çı', 'tekne', 'leri', 'sall', 'anıyor', '.'] (12 token)
TurkTokenizer-tr: ['İstanbul', 'Boğazı', "'", 'nda', 'balıkçı', 'tekneleri', 'sallanıyor', '.'] (8 token)

Morfolojik bölünme TurkTokenizer-tr'de daha doğal.

7.5 Limitations#

Code domain'de fertility yüksek (code-specific tokenizer gerek)
Math/LaTeX coverage zayıf
Multilingual değil (sadece TR)
Dialects under-represented

8. Model Card — HuggingFace Hub Standardı#

8.1 Template (Markdown)#

---
language:
  - tr
license: apache-2.0
tags:
  - tokenizer
  - turkish
  - bpe
  - byte-level
  - llm
---

# TurkTokenizer-tr (32K BPE)

## Model Description
TurkTokenizer-tr is a 32,000-vocab byte-level BPE tokenizer trained specifically for Turkish language modeling. It achieves fertility of 1.22 tokens/word on Turkish Wikipedia, ~17% better than Llama-3 default tokenizer.

## Intended Use
- Turkish LLM fine-tuning
- Turkish BERT-style model training
- Cost optimization for Turkish LLM inference

## Training Data
- Turkish Wikipedia (4GB)
- OSCAR Turkish 2023 subset (5GB)
- Turkish news corpus (1GB)
- Turkish literature + code comments (1GB)
- Total: ~11GB clean Turkish text

## Training Procedure
- Algorithm: BPE (byte-level)
- Vocab size: 32,000
- byte_fallback: True
- Max token length: 24
- Min frequency: 2
- Training time: ~26 minutes on 16-core EPYC

## Evaluation
| Metric | Value |
|---|---|
| Fertility (Wiki TR) | 1.22 |
| Fertility (News TR) | 1.20 |
| BPC (proxy LM) | 0.83 |
| OOV rate | 24% |
| Cross-lingual (TR/EN) | 1.40 |

Detayed comparison: [eval/benchmarks.csv](./eval/benchmarks.csv)

## Special Tokens
- Chat: <|im_start|>, <|im_end|>, <|user|>, <|assistant|>, <|system|>
- Llama-3 compat: <|eot_id|>, <|python_tag|>
- Cleaning: <EMAIL>, <PHONE>, <URL>, <NUMBER>

## Limitations
- Code domain fertility higher (~2.8)
- Math/LaTeX coverage limited
- Monolingual (Turkish only)
- Trained on ISTANBUL Turkish standard

## License
Apache 2.0. Commercial use allowed. Training data CC-BY-SA compatible.

## Citation
```bibtex
@software{turktokenizer_tr_2026,
  author = {Şükrü Yusuf KAYA},
  title = {TurkTokenizer-tr: A 32K BPE Tokenizer for Turkish},
  year = {2026},
  url = {https://huggingface.co/sukruyusufkaya/turktokenizer-tr-32k}
}

Acknowledgments#

Developed as Capstone Project for 'Sıfırdan Üretime: LLM Mühendisliği' course at sukruyusufkaya.com/learn/llm-muhendisligi.


### 8.2 Required sections (HF guidelines)
- Model Description
- Intended Use
- Limitations
- Training Data
- Training Procedure
- Evaluation
- License
- Citation

### 8.3 Quality criteria
- Reproducible (link to training script)
- Quantitative metrics (not vague claims)
- Honest limitations
- Clear license + attribution

9. License Seçimi#

9.1 Common open-source licenses#

License	Permissive	Commercial use	Attribution	Share-alike
MIT	✅	✅	✅	❌
Apache 2.0	✅	✅	✅	❌
BSD-3	✅	✅	✅	❌
GPL v3	❌	✅	✅	✅
CC-BY-SA	medium	✅	✅	✅
CC0	✅	✅	❌	❌

9.2 Choice for TurkTokenizer-tr#

Apache 2.0 seçiyoruz:

Permissive: commercial use OK
Patent grant clause: çakışan patent riski yok
Attribution gerekli ama virus-free (derived works restricted değil)
Modern open-source standard
HuggingFace ekosisteminde en yaygın

MIT da OK olabilirdi — Apache 2.0 daha güçlü patent koruması sağlıyor.

9.3 GPL niye değil#

GPL share-alike: bu tokenizer'ı kullanan herhangi bir model GPL olmak zorunda. Türkçe LLM topluluğu için kısıtlayıcı.

9.4 Training data license uyumu#

Wikipedia CC-BY-SA içeren training data:

CC-BY-SA derived work share-alike
AMA tokenizer model artifact (vocab + merges) derived work midir? Hukuken belirsiz alan.
Bizim yorumumuz: tokenizer artifact tek başına creative değil — bir 'extraction' (kelime frekansları + bölünmeler).
Apache 2.0 ile yayın defensible — Wikipedia community legal opinion benzer projelere izin veriyor.

Konservatif yaklaşım: Apache 2.0 + explicit attribution Wikipedia.

9.5 Attribution metni (README'ye eklenir)#

'TurkTokenizer-tr was trained on a corpus including Turkish Wikipedia (CC-BY-SA 3.0), OSCAR (CC0), and other openly-licensed sources. The tokenizer artifact is released under Apache 2.0 license.'

10. HuggingFace Hub Workflow#

10.1 Account + token#

pip install huggingface_hub
huggingface-cli login
# Token: hf_xxxxx (HF settings'den alınır)

10.2 Repo oluşturma#

from huggingface_hub import HfApi
api = HfApi()
api.create_repo(
    repo_id="sukruyusufkaya/turktokenizer-tr-32k",
    repo_type="model",
    private=False,  # public release
)

10.3 Upload#

from huggingface_hub import upload_folder
upload_folder(
    folder_path="./turktokenizer-tr-32k-hf",
    repo_id="sukruyusufkaya/turktokenizer-tr-32k",
    commit_message="Initial release v1.0.0",
)

Upload edilecek dosyalar:

tokenizer.json (~2 MB)
tokenizer_config.json
special_tokens_map.json
README.md (model card)
LICENSE (Apache 2.0 full text)
eval/ (benchmark results)
train_turktokenizer.py (training script)

10.4 Versioning#

Git tag based:

api.create_tag(
    repo_id="sukruyusufkaya/turktokenizer-tr-32k",
    tag="v1.0.0",
    revision="main",
)

Users şu şekilde load eder:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
    "sukruyusufkaya/turktokenizer-tr-32k",
    revision="v1.0.0",  # specific version
)

10.5 Spaces — interactive demo (opsiyonel)#

HuggingFace Spaces'da Gradio app yayınla:

import gradio as gr
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("sukruyusufkaya/turktokenizer-tr-32k")

def tokenize(text):
    ids = tok.encode(text)
    tokens = [tok.decode([i]) for i in ids]
    return ids, tokens, len(ids)

iface = gr.Interface(
    fn=tokenize,
    inputs="text",
    outputs=["text", "text", "number"],
    title="TurkTokenizer-tr Playground",
)
iface.launch()

10.6 Discoverability#

HF tags: turkish, tokenizer, bpe, llm
README iyi yazılmış → search rank
Blog post (sukruyusufkaya.com) HF Hub'a link
Twitter/LinkedIn announce
Community: r/LocalLLaMA, HF forums, Türkiye AI Slack/Discord

11. Production Integration#

11.1 transformers ile yükleme#

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("sukruyusufkaya/turktokenizer-tr-32k")
print(tok.tokenize("Merhaba dünya"))

11.2 vLLM ile inference (custom model + tokenizer)#

Kendi model'in TurkTokenizer-tr ile fine-tune edildiyse:

from vllm import LLM
llm = LLM(
    model="sukruyusufkaya/turk-llama-3-8b-instruct",
    tokenizer="sukruyusufkaya/turktokenizer-tr-32k",  # opsiyonel — model içinde default
)
output = llm.generate(["Türkiye'nin başkenti nedir?"])

11.3 Llama-3 fine-tune ile uyum#

ÖNEMLI: Llama-3 base model Llama-3 tokenizer ile pre-trained. TurkTokenizer-tr ile değiştirirsen embedding katmanını sıfırdan eğitmen gerekir.

2 stratejisi:

Pre-training from scratch: tokenizer + model birlikte. Pahalı.
Continued pre-training: Llama-3 base + Türkçe data + Llama-3 default tokenizer. Daha pratik.

TurkTokenizer-tr en çok scratch pre-training veya BERT-style Türkçe-only model için anlamlı.

11.4 Inference cost monitoring#

import time
start = time.time()
tokens = tok.encode("long text here...")
latency = time.time() - start
print(f"Latency: {latency*1000:.2f}ms, tokens: {len(tokens)}")

Production: Prometheus metric ekle (tokens_encoded_total, latency_ms_histogram).

11.5 Caching#

from functools import lru_cache

@lru_cache(maxsize=10000)
def cached_encode(text):
    return tuple(tok.encode(text))

İdempotent — safe cache.

11.6 sukruyusufkaya.com integration#

Site'de 'tools' bölümüne TurkTokenizer-tr playground ekle
Real-time tokenize web component (HF widget embed)
Türkçe maliyet hesaplayıcı (cl100k vs TurkTokenizer comparison)

12. Maintenance Roadmap#

12.1 Versioning strategy#

Semver:

MAJOR: breaking changes (vocab değişir → embedding incompat)
MINOR: backwards-compat additions (yeni special token)
PATCH: bug fixes (normalization regex fix)

İlk yayın v1.0.0. Üstüne:

v1.0.1: cleanup pipeline bug fix (mevcut vocab korunur)
v1.1.0: yeni domain corpus eklendi, retrain ama vocab size aynı
v2.0.0: vocab 32K → 50K — embedding incompat

12.2 Retraining schedule#

Öneri: yıllık retraining cycle.

Q1: feedback collect (HF issues, community feedback)
Q2: corpus update (new Wikipedia dump, OSCAR refresh)
Q3: retrain + evaluate
Q4: release v(N+1).0.0 if vocab change, v(N).M.0 if backwards-compat

12.3 Community feedback channels#

HuggingFace Hub repo Issues
GitHub repo (training scripts hosted there)
Blog post comments (sukruyusufkaya.com)
Türkiye AI Discord/Slack

12.4 Deprecation policy#

v1.x.x → v2.0.0 transition: v1 minimum 1 yıl maintain. Documentation v2'ye migration guide.

12.5 Monitoring#

HF download stats (haftalık check)
HF model card star count
Citing papers (Google Scholar alerts)
GitHub clones / stars

12.6 Roadmap (v1.x, v2.x)#

v1.1 (Q3 2026):

Augmented corpus: medical, legal sub-domains
Improved fertility metric for compound words

v1.2 (Q1 2027):

Multilingual variant (TR + EN co-training)
Code-aware mode

v2.0 (Q3 2027):

50K vocab (larger context efficiency)
Subword regularization training option
Tools/agents integration tokens

Capstone Egzersizleri#

Egzersiz 1#

Kendi mini-corpus'unu (100 MB) topla, yukarıdaki training script ile 8K vocab BPE eğit. Fertility ölç. Llama-3 ile karşılaştır.

Egzersiz 2#

TurkTokenizer-tr v1.0'ı kullanarak yeni bir HF Space (Gradio) oluştur — interactive token visualizer.

Egzersiz 3#

License decision tree: TurkTokenizer-tr training data Wikipedia + GitHub permissive + scraped news. Yayınlanacak license ne olmalı? Argüman ver.

Egzersiz 4#

TurkTokenizer-tr v1.0'a custom domain (legal) ekleyip v1.1 oluştur. Vocab size aynı kalsın. Step-by-step approach.

Egzersiz 5#

Production A/B testing: TurkTokenizer-tr vs Llama-3 default tokenizer, 1M user query üzerinde 2 hafta. Hangi metrikleri toplayacaksın? Statistical significance hesaplaması nasıl?

Egzersiz 6#

Model card yazımı — kendi capstone projen için Markdown draft yaz (limitations + intended use + evaluation sections odakla).

Egzersiz 7#

TurkTokenizer-tr ile fine-tune edilen Türkçe Llama-3 versiyonu maliyet projeksiyonu: 100K user/day SaaS için yıllık infrastructure cost vs OpenAI API'a göre karşılaştır.

Egzersiz 8#

Maintenance schedule: TurkTokenizer-tr için yıllık retraining cycle planla (Q1-Q4 actionable items). Community feedback channel nasıl yönet?

Egzersiz 9#

Tokenizer'ın 'fork'u: birisi TurkTokenizer-tr'i alıp medical-tr v0.1 yayınladı. License compliance açısından ne yapmalılar?

Egzersiz 10#

TurkTokenizer-tr'in akademik impact'i nasıl ölç? Paper yayını, citing modeller, downstream task improvement'ler için tracking strategy.

🎉 Modül 6 Tamamlandı — Tokenization Mikro-Cerrahisi

TurkTokenizer-tr capstone projesi ile Modül 6'nın 9 dersinin sentezi: corpus curation, cleaning pipeline, BPE training, 6.9 evaluation framework, model card, license decision, HuggingFace Hub publish, production integration, maintenance roadmap. Gerçek dünya artefakt: kendi tokenizer'ını dünyaya yayınlamayı öğrendin. Bu, profesyonel open-source LLM mühendisi olma sertifikasıdır. Modül 6 envanteri: 10 ders, 605 dakika (~10 saat) tokenization derinleşmesi. Sıradaki: Modül 7 — Embedding Katmanı (semantic vector space, word2vec'ten LLM embedding'lere).

Modül 6 Envanteri (Tamamlandı)#

#	Ders	Süre
6.1	Karakter, Sözcük, Subword: Tasarım Baskıları	55 dk
6.2	BPE Algoritması — Sennrich 2016 Satır Satır	55 dk
6.3	BPE'yi 200 Satırda Sıfırdan Yaz	60 dk
6.4	WordPiece (BERT) — Likelihood-Based Merges	55 dk
6.5	SentencePiece + Unigram LM (Kudo 2018)	60 dk
6.6	GPT-2/GPT-4 Byte-Level BPE + tiktoken Regex	60 dk
6.7	Special Tokens + ChatML + Chat Templates	70 dk
6.8	HuggingFace Tokenizers Rust + Production	80 dk
6.9	Tokenizer Evaluation: Fertility, Compression, BPC	75 dk
6.10	Capstone: TurkTokenizer-tr HuggingFace Hub Publish	90 dk
Toplam	10 ders	660 dk (~11 saat)

Genel Müfredat İlerleme#

Modül 0-6 bitti: 7 modül, 53 ders, ~50 saat tamamen üretildi. Kalan: Modül 7-25 (Part II-V), ~110 ders, ~75 saat. Toplam hedef: 25-30 modül, 158+ ders, ~125 saat ultra-detaylı Türkçe LLM mühendisliği müfredatı.

Sıradaki Modül: Embedding Katmanı#

Modül 7'de: token ID'lerinden anlam vektörüne geçiş. Word2Vec (Mikolov 2013) klasik öncesi → modern LLM embedding katmanı. Cosine similarity, distance metrics, dimensionality (Llama-3 4096, GPT-4 8192). Embedding tying (input/output paylaşımı), embedding initialization, embedding learning dynamics. Türkçe için sözcük→vektör projesi: word2vec-tr, fasttext-tr karşılaştırma, modern Türkçe semantic search demosu.

Frequently Asked Questions

Load with from_pretrained: `tok = AutoTokenizer.from_pretrained('sukruyusufkaya/turktokenizer-tr-32k')`. IMPORTANT: needs a fine-tuned model with this tokenizer — can't swap into Llama-3 default tokenizer (embedding incompat). First pre-train or fine-tune your own model with this tokenizer.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...