LLM Evaluation Benchmarks: MMLU, HELM, MT-Bench, LMSys Arena — Anatomy of Quality Measurement
LLM evaluation frameworks: MMLU (Hendrycks 2020) general knowledge, HELM (Stanford 2022) comprehensive, MT-Bench (Zheng 2023) chat, LMSys Chatbot Arena (community ELO ranking), GPQA (Rein 2023) graduate-level, HumanEval/MBPP code. Turkish benchmarks (TR-MMLU, MUKAYESE). Benchmark contamination concern, holistic evaluation.
Şükrü Yusuf KAYA
70 min read
Advanced📊 LLM Evaluation — 'iyi' demenin matematiği
GPT-4 'iyi' mi? Llama-3 'kötü' mü? Soyut. Ama: GPT-4 MMLU %86, Llama-3-8B MMLU %66. Quantifiable. Modern LLM evaluation çoklu benchmark: MMLU (general knowledge), HELM (comprehensive), MT-Bench (chat), LMSys Arena (ELO), GPQA (graduate-level), HumanEval (code). Türkçe için TR-MMLU ve MUKAYESE. Ama: benchmark contamination — pre-training data benchmark içeriyorsa skor şişirilir. Modern evaluation: holistic + contamination-aware. 70 dakika sonra: her major benchmark'ı, Türkçe için pratik kullanımı, contamination defense'i kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- Niye benchmark — quantifiable quality assessment
- MMLU (Hendrycks 2020) — 57 task multi-choice
- HELM (Stanford 2022) — comprehensive evaluation
- MT-Bench (Zheng 2023) — multi-turn chat quality
- LMSys Chatbot Arena — community ELO ranking
- GPQA (Rein 2023) — graduate-level science
- HumanEval + MBPP — code generation
- Türkçe benchmarks — TR-MMLU, MUKAYESE
- Benchmark contamination — test data leakage
- Holistic evaluation — Modern best practice
2-7. Major Benchmarks#
2.1 MMLU (Hendrycks 2020)#
'Massive Multitask Language Understanding'. 57 academic subjects (math, history, law, medicine).
Format: multiple choice (A/B/C/D), 15K+ questions.
Scoring: accuracy %. Random baseline %25 (4 options).
GPT-4: %86. Llama-3-70B: %82. Llama-3-8B: %66. Human expert: %90+.
2.2 HELM (Stanford 2022)#
'Holistic Evaluation of Language Models'. Comprehensive:
- Many tasks: classification, QA, summarization, etc.
- Many metrics: accuracy, calibration, robustness, fairness, bias, efficiency
- Many models compared
Not single number — multi-dimensional report.
2.3 MT-Bench (Zheng 2023)#
'Multi-Turn Benchmark'. 80 challenging questions, 2-turn conversations. GPT-4 judges responses.
Domains: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities.
Score: 1-10 (GPT-4 evaluation). GPT-4 self: ~9. Llama-3-70B: ~8. GPT-3.5: ~7.
2.4 LMSys Chatbot Arena#
Community-driven evaluation. Users compare two anonymous LLM responses, vote better.
ELO ranking system (chess-style).
Leaderboard: 100+ models ranked by user preference.
GPT-4o ~1320 ELO. Claude 3.5 Sonnet ~1300. Llama-3-70B-Instruct ~1220.
Most trusted real-world metric — humans evaluate, not GPT-4 self-evaluation.
2.5 GPQA (Rein 2023)#
'Graduate-level Google-Proof Q&A'. Multi-choice questions PhD-level science (physics, chemistry, biology).
Unique: 'Google-proof' — answers not findable via web search.
Expert PhDs: %65. GPT-4: %39. Claude 3 Opus: %50. o1: %78 (massive jump).
Reasoning models excel here. Pre-reasoning era models far behind humans.
2.6 HumanEval + MBPP#
Code generation benchmarks.
- HumanEval (Chen 2021, OpenAI): 164 Python coding problems
- MBPP (Austin 2021, Google): 974 Python problems
Metric: pass@1 (single attempt accuracy).
GPT-4o HumanEval: %90. Llama-3-70B: %81. DeepSeek-Coder: %85. o1: %92.
2.7 Türkçe benchmarks#
- TR-MMLU: MMLU Türkçe translation (Acıkgöz 2024)
- MUKAYESE: Comprehensive Türkçe NLP benchmark (Safaya 2022)
- NER, sentiment, summarization, machine translation
- HuggingFace Open LLM Leaderboard Turkish: Türkçe sub-leaderboard
Türkçe LLM results (TR-MMLU):
- GPT-4o: %75
- Llama-3-70B-Instruct (Türkçe fine-tuned): %62
- Cosmos-LLaMa-Instruct (Türkçe): %57
- Trendyol-LLM-7B-base: %43
9-10. Contamination + Holistic#
9.1 Benchmark contamination#
Problem: pre-training corpus benchmark içeriyor → model 'cheating'.
Example: GPT-4 pre-training data Common Crawl içeriyor. Common Crawl içerisinde MMLU question dump var (research papers). Model memorized answers, not actually reasoning.
9.2 Detection methods#
- Membership inference: model 'knows' specific question verbatim?
- Loss difference: benchmark questions have lower loss than similar non-benchmark?
- Held-out test: never-public eval set
Recent evidence: most major models contaminated to some degree. MMLU especially.
9.3 Defenses#
- Decontamination: pre-training corpus benchmark filter (Llama-3 paper)
- Held-out evaluation: LMSys Arena (real user queries, not public benchmark)
- Time-based filtering: train data before benchmark publication date
- New benchmarks: GPQA, ARC-AGI, MMLU-Pro (rotated questions)
9.4 Holistic evaluation 2026#
No single benchmark trusted. Compose:
- Academic (MMLU, GPQA): general knowledge
- Reasoning (MATH, AIME): math problem-solving
- Code (HumanEval, SWE-Bench): programming
- Chat (MT-Bench, LMSys Arena): conversation quality
- Safety (HarmBench): jailbreak resistance
- Türkçe (TR-MMLU, MUKAYESE): localized
Report multi-dimensional scorecard.
9.5 Practical for Türkçe model selection#
For Türkçe app dev:
- TR-MMLU: top score → general capability
- MUKAYESE NER/sentiment: domain-specific tasks
- LMSys Arena multilingual: user satisfaction
- Hand-test: 50 critical use-case Türkçe queries
No single number suffices.
🎉 Modül 21 Tamamlandı — Evaluation
LLM evaluation: çoklu benchmark. MMLU general, HELM comprehensive, MT-Bench chat, LMSys Arena community ELO, GPQA graduate-level, HumanEval code. Türkçe: TR-MMLU + MUKAYESE. Contamination real concern — modern models memorize benchmarks. Defense: decontamination filtering, held-out tests, new benchmarks. Holistic evaluation 2026 standard. Modül 21 envanteri: 1 ders, 70 dk. Genel müfredat: 22 modül, 93 ders, ~102 saat.
Modül 21 Envanteri (Tamamlandı)#
| # | Ders | Süre |
|---|---|---|
| 21.1 | LLM Evaluation Benchmarks | 70 dk |
| Toplam | 1 ders | 70 dk |
Frequently Asked Questions
Important but cautious: contamination + measurement noise. ~5% differential may not reflect real difference. For practical decisions, LMSys Arena ELO + use-case-specific tests more reliable.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup