Is MMLU %95 vs %86 difference important?

LLM Evaluation Benchmarks: MMLU, HELM, MT-Bench, LMSys Arena — Anatomy of Quality Measurement

LLM evaluation frameworks: MMLU (Hendrycks 2020) general knowledge, HELM (Stanford 2022) comprehensive, MT-Bench (Zheng 2023) chat, LMSys Chatbot Arena (community ELO ranking), GPQA (Rein 2023) graduate-level, HumanEval/MBPP code. Turkish benchmarks (TR-MMLU, MUKAYESE). Benchmark contamination concern, holistic evaluation.

Şükrü Yusuf KAYA

70 min read

5/13/2026

Advanced

LLM Evaluation Benchmark'ları: MMLU, HELM, MT-Bench, LMSys Arena — Quality Ölçümünün Anatomi

📊 LLM Evaluation — 'iyi' demenin matematiği

GPT-4 'iyi' mi? Llama-3 'kötü' mü? Soyut. Ama: GPT-4 MMLU %86, Llama-3-8B MMLU %66. Quantifiable. Modern LLM evaluation çoklu benchmark: MMLU (general knowledge), HELM (comprehensive), MT-Bench (chat), LMSys Arena (ELO), GPQA (graduate-level), HumanEval (code). Türkçe için TR-MMLU ve MUKAYESE. Ama: benchmark contamination — pre-training data benchmark içeriyorsa skor şişirilir. Modern evaluation: holistic + contamination-aware. 70 dakika sonra: her major benchmark'ı, Türkçe için pratik kullanımı, contamination defense'i kavramış olacaksın.

Ders Haritası (10 Bölüm)#

Niye benchmark — quantifiable quality assessment
MMLU (Hendrycks 2020) — 57 task multi-choice
HELM (Stanford 2022) — comprehensive evaluation
MT-Bench (Zheng 2023) — multi-turn chat quality
LMSys Chatbot Arena — community ELO ranking
GPQA (Rein 2023) — graduate-level science
HumanEval + MBPP — code generation
Türkçe benchmarks — TR-MMLU, MUKAYESE
Benchmark contamination — test data leakage
Holistic evaluation — Modern best practice

2-7. Major Benchmarks#

2.1 MMLU (Hendrycks 2020)#

'Massive Multitask Language Understanding'. 57 academic subjects (math, history, law, medicine). Format: multiple choice (A/B/C/D), 15K+ questions.

Scoring: accuracy %. Random baseline %25 (4 options).

GPT-4: %86. Llama-3-70B: %82. Llama-3-8B: %66. Human expert: %90+.

2.2 HELM (Stanford 2022)#

'Holistic Evaluation of Language Models'. Comprehensive:

Many tasks: classification, QA, summarization, etc.
Many metrics: accuracy, calibration, robustness, fairness, bias, efficiency
Many models compared

Not single number — multi-dimensional report.

2.3 MT-Bench (Zheng 2023)#

'Multi-Turn Benchmark'. 80 challenging questions, 2-turn conversations. GPT-4 judges responses.

Domains: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities.

Score: 1-10 (GPT-4 evaluation). GPT-4 self: ~9. Llama-3-70B: ~8. GPT-3.5: ~7.

2.4 LMSys Chatbot Arena#

Community-driven evaluation. Users compare two anonymous LLM responses, vote better. ELO ranking system (chess-style).

Leaderboard: 100+ models ranked by user preference. GPT-4o ~1320 ELO. Claude 3.5 Sonnet ~1300. Llama-3-70B-Instruct ~1220.

Most trusted real-world metric — humans evaluate, not GPT-4 self-evaluation.

2.5 GPQA (Rein 2023)#

'Graduate-level Google-Proof Q&A'. Multi-choice questions PhD-level science (physics, chemistry, biology).

Unique: 'Google-proof' — answers not findable via web search. Expert PhDs: %65. GPT-4: %39. Claude 3 Opus: %50. o1: %78 (massive jump).

Reasoning models excel here. Pre-reasoning era models far behind humans.

2.6 HumanEval + MBPP#

Code generation benchmarks.

HumanEval (Chen 2021, OpenAI): 164 Python coding problems
MBPP (Austin 2021, Google): 974 Python problems

Metric: pass@1 (single attempt accuracy). GPT-4o HumanEval: %90. Llama-3-70B: %81. DeepSeek-Coder: %85. o1: %92.

2.7 Türkçe benchmarks#

TR-MMLU: MMLU Türkçe translation (Acıkgöz 2024)
MUKAYESE: Comprehensive Türkçe NLP benchmark (Safaya 2022)
- NER, sentiment, summarization, machine translation
HuggingFace Open LLM Leaderboard Turkish: Türkçe sub-leaderboard

Türkçe LLM results (TR-MMLU):

GPT-4o: %75
Llama-3-70B-Instruct (Türkçe fine-tuned): %62
Cosmos-LLaMa-Instruct (Türkçe): %57
Trendyol-LLM-7B-base: %43

9-10. Contamination + Holistic#

9.1 Benchmark contamination#

Problem: pre-training corpus benchmark içeriyor → model 'cheating'.

Example: GPT-4 pre-training data Common Crawl içeriyor. Common Crawl içerisinde MMLU question dump var (research papers). Model memorized answers, not actually reasoning.

9.2 Detection methods#

Membership inference: model 'knows' specific question verbatim?
Loss difference: benchmark questions have lower loss than similar non-benchmark?
Held-out test: never-public eval set

Recent evidence: most major models contaminated to some degree. MMLU especially.

9.3 Defenses#

Decontamination: pre-training corpus benchmark filter (Llama-3 paper)
Held-out evaluation: LMSys Arena (real user queries, not public benchmark)
Time-based filtering: train data before benchmark publication date
New benchmarks: GPQA, ARC-AGI, MMLU-Pro (rotated questions)

9.4 Holistic evaluation 2026#

No single benchmark trusted. Compose:

Academic (MMLU, GPQA): general knowledge
Reasoning (MATH, AIME): math problem-solving
Code (HumanEval, SWE-Bench): programming
Chat (MT-Bench, LMSys Arena): conversation quality
Safety (HarmBench): jailbreak resistance
Türkçe (TR-MMLU, MUKAYESE): localized

Report multi-dimensional scorecard.

9.5 Practical for Türkçe model selection#

For Türkçe app dev:

TR-MMLU: top score → general capability
MUKAYESE NER/sentiment: domain-specific tasks
LMSys Arena multilingual: user satisfaction
Hand-test: 50 critical use-case Türkçe queries

No single number suffices.

🎉 Modül 21 Tamamlandı — Evaluation

LLM evaluation: çoklu benchmark. MMLU general, HELM comprehensive, MT-Bench chat, LMSys Arena community ELO, GPQA graduate-level, HumanEval code. Türkçe: TR-MMLU + MUKAYESE. Contamination real concern — modern models memorize benchmarks. Defense: decontamination filtering, held-out tests, new benchmarks. Holistic evaluation 2026 standard. Modül 21 envanteri: 1 ders, 70 dk. Genel müfredat: 22 modül, 93 ders, ~102 saat.

Modül 21 Envanteri (Tamamlandı)#

#	Ders	Süre
21.1	LLM Evaluation Benchmarks	70 dk
Toplam	1 ders	70 dk

Frequently Asked Questions

Important but cautious: contamination + measurement noise. ~5% differential may not reflect real difference. For practical decisions, LMSys Arena ELO + use-case-specific tests more reliable.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...