TR Benchmarking Suite: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + Custom
Standard suite for evaluating FT models in TR: TR-MMLU (general knowledge, Boğaziçi), Mukayese (TR NLP tasks), TruthfulQA-TR (hallucination), BBQ-TR (bias). Automated with lm-eval-harness. CI integration, regression alarms.
Şükrü Yusuf KAYA
26 min read
Advanced1. TR Benchmark Suite#
| Benchmark | Type | Size | Source |
|---|---|---|---|
| TR-MMLU | Multi-choice (genel knowledge) | 14K Q | Boğaziçi (Aksoy et al.) |
| Mukayese | Multiple tasks (NER, POS, sentiment, NLI) | varies | TR NLP Group |
| TruthfulQA-TR | Hallucination probe | 800 Q | Cosmos AI Lab |
| BBQ-TR | Bias (gender, age, ethnicity) | 1.2K Q | adapted from BBQ |
| MT-Bench-TR | Open-ended chat (judge LLM) | 80 Q | adapted from MT-Bench |
| Mukayese-MMLU-TR | TR-spesifik MMLU | 1K Q | Boğaziçi |
| GSM8K-TR | Math reasoning TR | 500 Q | translated GSM8K |
| Custom domain | Senin use-case'in | varies | senin |
Cookbook'un standart suite'i: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + MT-Bench-TR (judge: GPT-4o-mini).
bash
# === lm-eval-harness ile TR suite eval ===pip install lm-eval[trust_remote_code] lm_eval --model hf \ --model_args "pretrained=llama-3.1-8b-tr-finetuned,dtype=bfloat16" \ --tasks mmlu_tr,mukayese,truthfulqa_tr,bbq_tr \ --device cuda \ --batch_size auto \ --output_path eval_results/ # Cookbook'un CI script# .github/workflows/eval.yml# - python -m lm_eval ... > eval.json# - python compare_to_baseline.py eval.json baseline.json# - eğer regression > %2: fail, alarmlm-eval-harness TR suite + CI
2. CI Regression Alarms#
# compare_to_baseline.py import json current = json.load(open("eval.json")) baseline = json.load(open("baseline.json")) threshold = 0.02 # %2 regression tolere edilir for task in ["mmlu_tr", "mukayese", "truthfulqa_tr", "bbq_tr"]: delta = current[task]["acc"] - baseline[task]["acc"] if delta < -threshold: print(f"❌ {task} REGRESSION: {baseline[task]['acc']:.3f} → {current[task]['acc']:.3f} (Δ={delta:.3f})") exit(1) else: print(f"✅ {task}: {current[task]['acc']:.3f} (Δ={delta:+.3f})")
Cookbook'un production kuralı:
- Her FT iteration sonrası bu suite'i koş
- Eğer 3 task'ın 2'sinden fazlasında regression → release blocker
- Production deploy öncesi tüm 4 benchmark'ta improvement gerekli
✅ Part IX tamamlandı
- lm-eval-harness'i kur. 2) TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR koş. 3) Kendi model'inin TR alignment'ını sayısal olarak ispatla. 4) Cookbook'un Part XII'sine (Reasoning Model FT) ve Part XVI'sına (Production Operations) hazırsın.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations