Reasoning Eval: AIME 2024/2025 + MATH-500 + GPQA-Diamond + LiveCodeBench

Reasoning model'in standart eval suite'i: AIME 2024 (30 problem, USA Math Olympiad), AIME 2025 (yeni), MATH-500 (500 high-school competition), GPQA-Diamond (graduate-level science Q&A), LiveCodeBench (monthly-refreshed coding). pass@1 vs majority voting (pass@64) farkı. Cookbook standart eval pipeline.

Şükrü Yusuf KAYA

26 dakikalık okuma

26.06.2026

İleri

Reasoning Eval: AIME 2024/2025 + MATH-500 + GPQA-Diamond + LiveCodeBench

1. Reasoning Benchmark Tablosu (2026 başı)#

Benchmark	Size	Domain	Pass@1 (cookbook 8B baseline)	Pass@1 R1-Distill-8B	Pass@1 R1 671B
AIME 2024	30	competition math	5.6	28.5	79.8
AIME 2025	30	competition math	4.2	24.3	76.5
MATH-500	500	high-school math	47.2	78.1	97.3
GPQA-Diamond	198	grad science	25.7	36.4	71.5
LiveCodeBench v5	400+	recent coding	18.5	36.5	65.9

Pass@1 vs Majority@64:

Pass@1: greedy decode, tek cevap
Majority@64: 64 sample, en çok gelen cevap (test-time compute)

R1 671B AIME 2024: pass@1=79.8, majority@64=86.7 (+7 puan).

Cookbook'un sertifika eşiği (8B model):

AIME 2024 pass@1 ≥ 20 (R1-Distill seviye)
MATH-500 pass@1 ≥ 70
GPQA-Diamond pass@1 ≥ 30

✅ Part XII tamamlandı

Trained reasoning model'i 4 benchmark'ta eval et. 2) pass@1 vs majority@8 farkını gör. 3) Sonraki Part: Part XIII — Custom Kernels & Triton.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Part 0 — Engineering Foundations

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Öğrenmeye Başla

Part 0 — Engineering Foundations

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Öğrenmeye Başla

Part 0 — Engineering Foundations

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Öğrenmeye Başla