Skip to content

LLM Evaluation Benchmarks: MMLU, HELM, MT-Bench, LMSys Arena — Anatomy of Quality Measurement

LLM evaluation frameworks: MMLU (Hendrycks 2020) general knowledge, HELM (Stanford 2022) comprehensive, MT-Bench (Zheng 2023) chat, LMSys Chatbot Arena (community ELO ranking), GPQA (Rein 2023) graduate-level, HumanEval/MBPP code. Turkish benchmarks (TR-MMLU, MUKAYESE). Benchmark contamination concern, holistic evaluation.

Şükrü Yusuf KAYA
70 min read
Advanced
LLM Evaluation Benchmark'ları: MMLU, HELM, MT-Bench, LMSys Arena — Quality Ölçümünün Anatomi
📊 LLM Evaluation — 'iyi' demenin matematiği
GPT-4 'iyi' mi? Llama-3 'kötü' mü? Soyut. Ama: GPT-4 MMLU %86, Llama-3-8B MMLU %66. Quantifiable. Modern LLM evaluation çoklu benchmark: MMLU (general knowledge), HELM (comprehensive), MT-Bench (chat), LMSys Arena (ELO), GPQA (graduate-level), HumanEval (code). Türkçe için TR-MMLU ve MUKAYESE. Ama: benchmark contamination — pre-training data benchmark içeriyorsa skor şişirilir. Modern evaluation: holistic + contamination-aware. 70 dakika sonra: her major benchmark'ı, Türkçe için pratik kullanımı, contamination defense'i kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. Niye benchmark — quantifiable quality assessment
  2. MMLU (Hendrycks 2020) — 57 task multi-choice
  3. HELM (Stanford 2022) — comprehensive evaluation
  4. MT-Bench (Zheng 2023) — multi-turn chat quality
  5. LMSys Chatbot Arena — community ELO ranking
  6. GPQA (Rein 2023) — graduate-level science
  7. HumanEval + MBPP — code generation
  8. Türkçe benchmarks — TR-MMLU, MUKAYESE
  9. Benchmark contamination — test data leakage
  10. Holistic evaluation — Modern best practice

2-7. Major Benchmarks#

2.1 MMLU (Hendrycks 2020)#

'Massive Multitask Language Understanding'. 57 academic subjects (math, history, law, medicine). Format: multiple choice (A/B/C/D), 15K+ questions.
Scoring: accuracy %. Random baseline %25 (4 options).
GPT-4: %86. Llama-3-70B: %82. Llama-3-8B: %66. Human expert: %90+.

2.2 HELM (Stanford 2022)#

'Holistic Evaluation of Language Models'. Comprehensive:
  • Many tasks: classification, QA, summarization, etc.
  • Many metrics: accuracy, calibration, robustness, fairness, bias, efficiency
  • Many models compared
Not single number — multi-dimensional report.

2.3 MT-Bench (Zheng 2023)#

'Multi-Turn Benchmark'. 80 challenging questions, 2-turn conversations. GPT-4 judges responses.
Domains: writing, roleplay, reasoning, math, coding, extraction, STEM, humanities.
Score: 1-10 (GPT-4 evaluation). GPT-4 self: ~9. Llama-3-70B: ~8. GPT-3.5: ~7.

2.4 LMSys Chatbot Arena#

Community-driven evaluation. Users compare two anonymous LLM responses, vote better. ELO ranking system (chess-style).
Leaderboard: 100+ models ranked by user preference. GPT-4o ~1320 ELO. Claude 3.5 Sonnet ~1300. Llama-3-70B-Instruct ~1220.
Most trusted real-world metric — humans evaluate, not GPT-4 self-evaluation.

2.5 GPQA (Rein 2023)#

'Graduate-level Google-Proof Q&A'. Multi-choice questions PhD-level science (physics, chemistry, biology).
Unique: 'Google-proof' — answers not findable via web search. Expert PhDs: %65. GPT-4: %39. Claude 3 Opus: %50. o1: %78 (massive jump).
Reasoning models excel here. Pre-reasoning era models far behind humans.

2.6 HumanEval + MBPP#

Code generation benchmarks.
  • HumanEval (Chen 2021, OpenAI): 164 Python coding problems
  • MBPP (Austin 2021, Google): 974 Python problems
Metric: pass@1 (single attempt accuracy). GPT-4o HumanEval: %90. Llama-3-70B: %81. DeepSeek-Coder: %85. o1: %92.

2.7 Türkçe benchmarks#

  • TR-MMLU: MMLU Türkçe translation (Acıkgöz 2024)
  • MUKAYESE: Comprehensive Türkçe NLP benchmark (Safaya 2022)
    • NER, sentiment, summarization, machine translation
  • HuggingFace Open LLM Leaderboard Turkish: Türkçe sub-leaderboard
Türkçe LLM results (TR-MMLU):
  • GPT-4o: %75
  • Llama-3-70B-Instruct (Türkçe fine-tuned): %62
  • Cosmos-LLaMa-Instruct (Türkçe): %57
  • Trendyol-LLM-7B-base: %43

9-10. Contamination + Holistic#

9.1 Benchmark contamination#

Problem: pre-training corpus benchmark içeriyor → model 'cheating'.
Example: GPT-4 pre-training data Common Crawl içeriyor. Common Crawl içerisinde MMLU question dump var (research papers). Model memorized answers, not actually reasoning.

9.2 Detection methods#

  • Membership inference: model 'knows' specific question verbatim?
  • Loss difference: benchmark questions have lower loss than similar non-benchmark?
  • Held-out test: never-public eval set
Recent evidence: most major models contaminated to some degree. MMLU especially.

9.3 Defenses#

  • Decontamination: pre-training corpus benchmark filter (Llama-3 paper)
  • Held-out evaluation: LMSys Arena (real user queries, not public benchmark)
  • Time-based filtering: train data before benchmark publication date
  • New benchmarks: GPQA, ARC-AGI, MMLU-Pro (rotated questions)

9.4 Holistic evaluation 2026#

No single benchmark trusted. Compose:
  • Academic (MMLU, GPQA): general knowledge
  • Reasoning (MATH, AIME): math problem-solving
  • Code (HumanEval, SWE-Bench): programming
  • Chat (MT-Bench, LMSys Arena): conversation quality
  • Safety (HarmBench): jailbreak resistance
  • Türkçe (TR-MMLU, MUKAYESE): localized
Report multi-dimensional scorecard.

9.5 Practical for Türkçe model selection#

For Türkçe app dev:
  1. TR-MMLU: top score → general capability
  2. MUKAYESE NER/sentiment: domain-specific tasks
  3. LMSys Arena multilingual: user satisfaction
  4. Hand-test: 50 critical use-case Türkçe queries
No single number suffices.
🎉 Modül 21 Tamamlandı — Evaluation
LLM evaluation: çoklu benchmark. MMLU general, HELM comprehensive, MT-Bench chat, LMSys Arena community ELO, GPQA graduate-level, HumanEval code. Türkçe: TR-MMLU + MUKAYESE. Contamination real concern — modern models memorize benchmarks. Defense: decontamination filtering, held-out tests, new benchmarks. Holistic evaluation 2026 standard. Modül 21 envanteri: 1 ders, 70 dk. Genel müfredat: 22 modül, 93 ders, ~102 saat.

Modül 21 Envanteri (Tamamlandı)#

#DersSüre
21.1LLM Evaluation Benchmarks70 dk
Toplam1 ders70 dk

Frequently Asked Questions

Important but cautious: contamination + measurement noise. ~5% differential may not reflect real difference. For practical decisions, LMSys Arena ELO + use-case-specific tests more reliable.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content