Reasoning Eval: AIME 2024/2025 + MATH-500 + GPQA-Diamond + LiveCodeBench
Reasoning model standard eval suite: AIME 2024 (30 problems, USA Math Olympiad), AIME 2025 (new), MATH-500 (500 high-school competition), GPQA-Diamond (graduate-level science Q&A), LiveCodeBench (monthly-refreshed). pass@1 vs majority voting (pass@64) difference. Cookbook standard eval pipeline.
Şükrü Yusuf KAYA
26 min read
Advanced1. Reasoning Benchmark Tablosu (2026 başı)#
| Benchmark | Size | Domain | Pass@1 (cookbook 8B baseline) | Pass@1 R1-Distill-8B | Pass@1 R1 671B |
|---|---|---|---|---|---|
| AIME 2024 | 30 | competition math | 5.6 | 28.5 | 79.8 |
| AIME 2025 | 30 | competition math | 4.2 | 24.3 | 76.5 |
| MATH-500 | 500 | high-school math | 47.2 | 78.1 | 97.3 |
| GPQA-Diamond | 198 | grad science | 25.7 | 36.4 | 71.5 |
| LiveCodeBench v5 | 400+ | recent coding | 18.5 | 36.5 | 65.9 |
Pass@1 vs Majority@64:
- Pass@1: greedy decode, tek cevap
- Majority@64: 64 sample, en çok gelen cevap (test-time compute)
R1 671B AIME 2024: pass@1=79.8, majority@64=86.7 (+7 puan).
Cookbook'un sertifika eşiği (8B model):
- AIME 2024 pass@1 ≥ 20 (R1-Distill seviye)
- MATH-500 pass@1 ≥ 70
- GPQA-Diamond pass@1 ≥ 30
✅ Part XII tamamlandı
- Trained reasoning model'i 4 benchmark'ta eval et. 2) pass@1 vs majority@8 farkını gör. 3) Sonraki Part: Part XIII — Custom Kernels & Triton.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations