Why does Llama-3 Chinchilla-overtrain?

Chinchilla Scaling Laws (2022): Hoffmann et al. — 1:1 Param:Data Revolution

Hoffmann et al. 2022 'Training Compute-Optimal LLMs' paper — corrected Kaplan. Kaplan undertrained models bias. Chinchilla recipe: N ≈ D (1:1 ratio). 70B Chinchilla model > 280B Gopher (Hoffmann). Llama-3 Chinchilla-aware. New compute-optimal formula, post-Chinchilla overtraining trend.

Şükrü Yusuf KAYA

70 min read

5/13/2026

Advanced

Chinchilla Scaling Laws (2022): Hoffmann et al. — 1:1 Param:Data Devrim

🐀 Chinchilla — endüstrinin yön değiştirdiği paper

Mart 2022. DeepMind: Hoffmann, Borgeaud, Mensch et al. 'Training Compute-Optimal LLMs' paper'ını yayınladı. Bombshell: Kaplan recommendation yanlıştı. Modern büyük modeller (GPT-3, Gopher, PaLM) undertrained. Optimum allocation: N ≈ D (parameter count ≈ token count). 70B Chinchilla model 4× daha az param ile 280B Gopher'ı yener. Llama-3 Chinchilla-aware: 70B param, 15T token. Post-2022 industry shift: 'data > params'. 70 dakika sonra: Chinchilla'nın matematiksel anatomisini, methodology farklılıklarını, modern LLM training implication'larını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

Hoffmann methodology — Kaplan'dan farklı approach
400 modeller training — careful experimental design
Compute-optimal frontier — 1:1 N
Math formula — yeni power law
Chinchilla 70B vs Gopher 280B — head-to-head
Modern model retrofit — GPT-3 etc. undertrained
Llama-3 Chinchilla-aware — 70B / 15T tokens
Post-Chinchilla overtraining — DeepSeek-V3 overtrain trend
Implications — endüstri yön değiştirdi
Türkçe için — Chinchilla recipe pratik anlam

1-4. Chinchilla Methodology#

1.1 Kaplan'dan fark#

Kaplan 2020:

Fixed model size, vary data
Compute relation reasoned from extrapolation
Models often undertrained (data limited)

Hoffmann 2022:

Vary BOTH N and D simultaneously
400+ training runs
70M to 16B params, 5B to 500B tokens
Truly explore (N, D) space

1.2 Key finding#

Given compute C:

Optimal N* ∝ C^0.5
Optimal D* ∝ C^0.5
N* / D* ≈ constant ≈ 1:20 ratio in tokens (or 1:1 in 'normalized terms')

Actually: D / N ≈ 20 (in token to param ratio). For 70B params: 1.4T tokens.

1.3 Updated power law#

L(N, D) = E + A / N^α + B / D^β
E ≈ 1.69 (irreducible loss)
A ≈ 406, B ≈ 410
α ≈ 0.34, β ≈ 0.28

Different exponents than Kaplan. Reflect actual relationship better.

1.4 Compute-optimal allocation#

Minimize L subject to C = 6 N D. Lagrangian → critical points:

N* = G × (C / 6)^a where a ≈ 0.5
D* = G^{-1} × (C / 6)^b where b ≈ 0.5

N and D scale equally with compute.

1.5 Chinchilla model#

Hoffmann trained Chinchilla:

N = 70B params
D = 1.4T tokens
C = ~5.9 × 10^23 FLOPs

Compare Gopher (DeepMind 2021):

N = 280B params (4x bigger)
D = 300B tokens (5x less)
C = same as Chinchilla

Same compute, Chinchilla wins all benchmarks. Proves: Gopher undertrained.

1.6 Other 2020-2021 models retrofit#

Using Chinchilla framework:

GPT-3 175B: optimal D ≈ 3.5T (actual 300B → undertrained 12x)
PaLM 540B: optimal D ≈ 11T (actual 780B → undertrained 14x)
All major LLMs undertrained

1.7 Llama applications#

Llama-2 7B: 1.4T tokens (~200:1 — actually overtrained per Chinchilla)
Llama-2 70B: 1.4T tokens (~20:1 Chinchilla-optimal)
Llama-3 8B: 15T tokens (~1875:1 — heavily overtrained)
Llama-3 70B: 15T tokens (~214:1 — overtrained)

Llama-3 deliberately overtrains for inference efficiency: smaller model with more data more cost-effective at inference.

7-10. Post-Chinchilla Era#

7.1 'Overtraining' trend#

2023-2024 industry shift:

Llama-3 8B: 15T tokens (>2000:1)
Mistral 7B: 8T tokens (>1100:1)
Qwen 2: 7-15T tokens overtrain

Niye overtrain (vs Chinchilla optimal):

Inference cost: smaller model cheaper to serve
Train-time compute: amortized over many users
Quality keeps improving even past Chinchilla-optimal

New motto: 'small model, lots of data, deploy efficiently'.

7.2 DeepSeek-V3 overtraining frontier#

DeepSeek-V3 (2024): 14.8T tokens for 671B param MoE (active 37B). MoE Chinchilla calculation different — sparse compute.

7.3 Mosaic + DBRX scaling#

Mosaic ML Research: 2024 papers showing optimal allocation depends on inference economics.

7.4 Türkçe scaling implications#

7B Türkçe model: ~1-2T Türkçe tokens optimal (Chinchilla)
Overtrain: 5-10T Türkçe (mevcut değil, 30 GB max ~5B tokens)
Practical: Türkçe corpus sınırlı → small Turkish-only model + multilingual base mix

7.5 Industry direction (2026)#

Compute getting cheaper
Data scarcity becoming bottleneck
Synthetic data + curation > raw scale
Quality > quantity (already pre-Chinchilla theme, more so post)

7.6 Llama-3-405B reality#

Llama-3-405B: 405B params, 15T tokens. Ratio ~37:1 (Chinchilla-near). Largest publicly known. Compute cost ~$500M+. Future scale: economic question.

✅ Ders 12.2 Özeti — Chinchilla

Hoffmann 2022 Chinchilla: Kaplan refutation. Optimal allocation: N ≈ 20 × D (token-to-param ratio). Modern büyük modeller (GPT-3, PaLM) undertrained. 70B Chinchilla 280B Gopher'ı yener. Llama-3 strategy: Chinchilla-overtraining (inference efficiency için). DeepSeek-V3, Mistral, Qwen aynı pattern. Türkçe için pratik: corpus scarcity → small model + multilingual mix. Ders 12.3'te capstone — kendi training compute budget'ı planlamak.

Sıradaki Ders: Compute Budget Planning Capstone#

Ders 12.3: kendi LLM training budget'ı planla. Compute estimator, Chinchilla-aware allocation, cost calculator (GPU-hour, $).

Frequently Asked Questions

Inference economy. Smaller model + more tokens > larger model Chinchilla-optimal — inference 100x cheaper (less params), training amortized over many users. Inference > train compute long-run.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...