Reasoning Eval: AIME 2024/2025 + MATH-500 + GPQA-Diamond + LiveCodeBench

Reasoning model standard eval suite: AIME 2024 (30 problems, USA Math Olympiad), AIME 2025 (new), MATH-500 (500 high-school competition), GPQA-Diamond (graduate-level science Q&A), LiveCodeBench (monthly-refreshed). pass@1 vs majority voting (pass@64) difference. Cookbook standard eval pipeline.

Şükrü Yusuf KAYA

26 min read

6/26/2026

Advanced

Reasoning Eval: AIME 2024/2025 + MATH-500 + GPQA-Diamond + LiveCodeBench

1. Reasoning Benchmark Tablosu (2026 başı)#

Benchmark	Size	Domain	Pass@1 (cookbook 8B baseline)	Pass@1 R1-Distill-8B	Pass@1 R1 671B
AIME 2024	30	competition math	5.6	28.5	79.8
AIME 2025	30	competition math	4.2	24.3	76.5
MATH-500	500	high-school math	47.2	78.1	97.3
GPQA-Diamond	198	grad science	25.7	36.4	71.5
LiveCodeBench v5	400+	recent coding	18.5	36.5	65.9

Pass@1 vs Majority@64:

Pass@1: greedy decode, tek cevap
Majority@64: 64 sample, en çok gelen cevap (test-time compute)

R1 671B AIME 2024: pass@1=79.8, majority@64=86.7 (+7 puan).

Cookbook'un sertifika eşiği (8B model):

AIME 2024 pass@1 ≥ 20 (R1-Distill seviye)
MATH-500 pass@1 ≥ 70
GPQA-Diamond pass@1 ≥ 30

✅ Part XII tamamlandı

Trained reasoning model'i 4 benchmark'ta eval et. 2) pass@1 vs majority@8 farkını gör. 3) Sonraki Part: Part XIII — Custom Kernels & Triton.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Reasoning Eval: AIME 2024/2025 + MATH-500 + GPQA-Diamond + LiveCodeBench

1. Reasoning Benchmark Tablosu (2026 başı)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter