Skip to content

TR Benchmarking Suite: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + Custom

Standard suite for evaluating FT models in TR: TR-MMLU (general knowledge, Boğaziçi), Mukayese (TR NLP tasks), TruthfulQA-TR (hallucination), BBQ-TR (bias). Automated with lm-eval-harness. CI integration, regression alarms.

Şükrü Yusuf KAYA
26 min read
Advanced
TR Benchmarking Suite: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + Custom

1. TR Benchmark Suite#

BenchmarkTypeSizeSource
TR-MMLUMulti-choice (genel knowledge)14K QBoğaziçi (Aksoy et al.)
MukayeseMultiple tasks (NER, POS, sentiment, NLI)variesTR NLP Group
TruthfulQA-TRHallucination probe800 QCosmos AI Lab
BBQ-TRBias (gender, age, ethnicity)1.2K Qadapted from BBQ
MT-Bench-TROpen-ended chat (judge LLM)80 Qadapted from MT-Bench
Mukayese-MMLU-TRTR-spesifik MMLU1K QBoğaziçi
GSM8K-TRMath reasoning TR500 Qtranslated GSM8K
Custom domainSenin use-case'invariessenin
Cookbook'un standart suite'i: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + MT-Bench-TR (judge: GPT-4o-mini).
bash
# === lm-eval-harness ile TR suite eval ===
pip install lm-eval[trust_remote_code]
 
lm_eval --model hf \
--model_args "pretrained=llama-3.1-8b-tr-finetuned,dtype=bfloat16" \
--tasks mmlu_tr,mukayese,truthfulqa_tr,bbq_tr \
--device cuda \
--batch_size auto \
--output_path eval_results/
 
# Cookbook'un CI script
# .github/workflows/eval.yml
# - python -m lm_eval ... > eval.json
# - python compare_to_baseline.py eval.json baseline.json
# - eğer regression > %2: fail, alarm
lm-eval-harness TR suite + CI

2. CI Regression Alarms#

# compare_to_baseline.py import json current = json.load(open("eval.json")) baseline = json.load(open("baseline.json")) threshold = 0.02 # %2 regression tolere edilir for task in ["mmlu_tr", "mukayese", "truthfulqa_tr", "bbq_tr"]: delta = current[task]["acc"] - baseline[task]["acc"] if delta < -threshold: print(f"❌ {task} REGRESSION: {baseline[task]['acc']:.3f} → {current[task]['acc']:.3f} (Δ={delta:.3f})") exit(1) else: print(f"✅ {task}: {current[task]['acc']:.3f} (Δ={delta:+.3f})")
Cookbook'un production kuralı:
  • Her FT iteration sonrası bu suite'i koş
  • Eğer 3 task'ın 2'sinden fazlasında regression → release blocker
  • Production deploy öncesi tüm 4 benchmark'ta improvement gerekli
✅ Part IX tamamlandı
  1. lm-eval-harness'i kur. 2) TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR koş. 3) Kendi model'inin TR alignment'ını sayısal olarak ispatla. 4) Cookbook'un Part XII'sine (Reasoning Model FT) ve Part XVI'sına (Production Operations) hazırsın.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content