TR Benchmarking Suite: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + Custom

Standard suite for evaluating FT models in TR: TR-MMLU (general knowledge, Boğaziçi), Mukayese (TR NLP tasks), TruthfulQA-TR (hallucination), BBQ-TR (bias). Automated with lm-eval-harness. CI integration, regression alarms.

Şükrü Yusuf KAYA

26 min read

6/26/2026

Advanced

TR Benchmarking Suite: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + Custom

1. TR Benchmark Suite#

Benchmark	Type	Size	Source
TR-MMLU	Multi-choice (genel knowledge)	14K Q	Boğaziçi (Aksoy et al.)
Mukayese	Multiple tasks (NER, POS, sentiment, NLI)	varies	TR NLP Group
TruthfulQA-TR	Hallucination probe	800 Q	Cosmos AI Lab
BBQ-TR	Bias (gender, age, ethnicity)	1.2K Q	adapted from BBQ
MT-Bench-TR	Open-ended chat (judge LLM)	80 Q	adapted from MT-Bench
Mukayese-MMLU-TR	TR-spesifik MMLU	1K Q	Boğaziçi
GSM8K-TR	Math reasoning TR	500 Q	translated GSM8K
Custom domain	Senin use-case'in	varies	senin

Cookbook'un standart suite'i: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + MT-Bench-TR (judge: GPT-4o-mini).

bash

# === lm-eval-harness ile TR suite eval ===
pip install lm-eval[trust_remote_code]
 
lm_eval --model hf \
    --model_args "pretrained=llama-3.1-8b-tr-finetuned,dtype=bfloat16" \
    --tasks mmlu_tr,mukayese,truthfulqa_tr,bbq_tr \
    --device cuda \
    --batch_size auto \
    --output_path eval_results/
 
# Cookbook'un CI script
# .github/workflows/eval.yml
# - python -m lm_eval ... > eval.json
# - python compare_to_baseline.py eval.json baseline.json
# - eğer regression > %2: fail, alarm

lm-eval-harness TR suite + CI

2. CI Regression Alarms#

# compare_to_baseline.py
import json

current = json.load(open("eval.json"))
baseline = json.load(open("baseline.json"))

threshold = 0.02   # %2 regression tolere edilir

for task in ["mmlu_tr", "mukayese", "truthfulqa_tr", "bbq_tr"]:
    delta = current[task]["acc"] - baseline[task]["acc"]
    if delta < -threshold:
        print(f"❌ {task} REGRESSION: {baseline[task]['acc']:.3f} → {current[task]['acc']:.3f} (Δ={delta:.3f})")
        exit(1)
    else:
        print(f"✅ {task}: {current[task]['acc']:.3f} (Δ={delta:+.3f})")

Cookbook'un production kuralı:

Her FT iteration sonrası bu suite'i koş
Eğer 3 task'ın 2'sinden fazlasında regression → release blocker
Production deploy öncesi tüm 4 benchmark'ta improvement gerekli

✅ Part IX tamamlandı

lm-eval-harness'i kur. 2) TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR koş. 3) Kendi model'inin TR alignment'ını sayısal olarak ispatla. 4) Cookbook'un Part XII'sine (Reasoning Model FT) ve Part XVI'sına (Production Operations) hazırsın.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

TR Benchmarking Suite: TR-MMLU + Mukayese + TruthfulQA-TR + BBQ-TR + Custom

1. TR Benchmark Suite#

2. CI Regression Alarms#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter