Skip to content

Chinchilla Scaling Laws (2022): Hoffmann et al. — 1:1 Param:Data Revolution

Hoffmann et al. 2022 'Training Compute-Optimal LLMs' paper — corrected Kaplan. Kaplan undertrained models bias. Chinchilla recipe: N ≈ D (1:1 ratio). 70B Chinchilla model > 280B Gopher (Hoffmann). Llama-3 Chinchilla-aware. New compute-optimal formula, post-Chinchilla overtraining trend.

Şükrü Yusuf KAYA
70 min read
Advanced
Chinchilla Scaling Laws (2022): Hoffmann et al. — 1:1 Param:Data Devrim
🐀 Chinchilla — endüstrinin yön değiştirdiği paper
Mart 2022. DeepMind: Hoffmann, Borgeaud, Mensch et al. 'Training Compute-Optimal LLMs' paper'ını yayınladı. Bombshell: Kaplan recommendation yanlıştı. Modern büyük modeller (GPT-3, Gopher, PaLM) undertrained. Optimum allocation: N ≈ D (parameter count ≈ token count). 70B Chinchilla model 4× daha az param ile 280B Gopher'ı yener. Llama-3 Chinchilla-aware: 70B param, 15T token. Post-2022 industry shift: 'data > params'. 70 dakika sonra: Chinchilla'nın matematiksel anatomisini, methodology farklılıklarını, modern LLM training implication'larını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. Hoffmann methodology — Kaplan'dan farklı approach
  2. 400 modeller training — careful experimental design
  3. Compute-optimal frontier — 1:1 N
  4. Math formula — yeni power law
  5. Chinchilla 70B vs Gopher 280B — head-to-head
  6. Modern model retrofit — GPT-3 etc. undertrained
  7. Llama-3 Chinchilla-aware — 70B / 15T tokens
  8. Post-Chinchilla overtraining — DeepSeek-V3 overtrain trend
  9. Implications — endüstri yön değiştirdi
  10. Türkçe için — Chinchilla recipe pratik anlam

1-4. Chinchilla Methodology#

1.1 Kaplan'dan fark#

Kaplan 2020:
  • Fixed model size, vary data
  • Compute relation reasoned from extrapolation
  • Models often undertrained (data limited)
Hoffmann 2022:
  • Vary BOTH N and D simultaneously
  • 400+ training runs
  • 70M to 16B params, 5B to 500B tokens
  • Truly explore (N, D) space

1.2 Key finding#

Given compute C:
Optimal N* ∝ C^0.5 Optimal D* ∝ C^0.5 N* / D* ≈ constant ≈ 1:20 ratio in tokens (or 1:1 in 'normalized terms')
Actually: D / N ≈ 20 (in token to param ratio). For 70B params: 1.4T tokens.

1.3 Updated power law#

L(N, D) = E + A / N^α + B / D^β E ≈ 1.69 (irreducible loss) A ≈ 406, B ≈ 410 α ≈ 0.34, β ≈ 0.28
Different exponents than Kaplan. Reflect actual relationship better.

1.4 Compute-optimal allocation#

Minimize L subject to C = 6 N D. Lagrangian → critical points:
N* = G × (C / 6)^a where a ≈ 0.5 D* = G^{-1} × (C / 6)^b where b ≈ 0.5
N and D scale equally with compute.

1.5 Chinchilla model#

Hoffmann trained Chinchilla:
  • N = 70B params
  • D = 1.4T tokens
  • C = ~5.9 × 10^23 FLOPs
Compare Gopher (DeepMind 2021):
  • N = 280B params (4x bigger)
  • D = 300B tokens (5x less)
  • C = same as Chinchilla
Same compute, Chinchilla wins all benchmarks. Proves: Gopher undertrained.

1.6 Other 2020-2021 models retrofit#

Using Chinchilla framework:
  • GPT-3 175B: optimal D ≈ 3.5T (actual 300B → undertrained 12x)
  • PaLM 540B: optimal D ≈ 11T (actual 780B → undertrained 14x)
  • All major LLMs undertrained

1.7 Llama applications#

  • Llama-2 7B: 1.4T tokens (~200:1 — actually overtrained per Chinchilla)
  • Llama-2 70B: 1.4T tokens (~20:1 Chinchilla-optimal)
  • Llama-3 8B: 15T tokens (~1875:1 — heavily overtrained)
  • Llama-3 70B: 15T tokens (~214:1 — overtrained)
Llama-3 deliberately overtrains for inference efficiency: smaller model with more data more cost-effective at inference.

7-10. Post-Chinchilla Era#

7.1 'Overtraining' trend#

2023-2024 industry shift:
  • Llama-3 8B: 15T tokens (>2000:1)
  • Mistral 7B: 8T tokens (>1100:1)
  • Qwen 2: 7-15T tokens overtrain
Niye overtrain (vs Chinchilla optimal):
  • Inference cost: smaller model cheaper to serve
  • Train-time compute: amortized over many users
  • Quality keeps improving even past Chinchilla-optimal
New motto: 'small model, lots of data, deploy efficiently'.

7.2 DeepSeek-V3 overtraining frontier#

DeepSeek-V3 (2024): 14.8T tokens for 671B param MoE (active 37B). MoE Chinchilla calculation different — sparse compute.

7.3 Mosaic + DBRX scaling#

Mosaic ML Research: 2024 papers showing optimal allocation depends on inference economics.

7.4 Türkçe scaling implications#

  • 7B Türkçe model: ~1-2T Türkçe tokens optimal (Chinchilla)
  • Overtrain: 5-10T Türkçe (mevcut değil, 30 GB max ~5B tokens)
  • Practical: Türkçe corpus sınırlı → small Turkish-only model + multilingual base mix

7.5 Industry direction (2026)#

  • Compute getting cheaper
  • Data scarcity becoming bottleneck
  • Synthetic data + curation > raw scale
  • Quality > quantity (already pre-Chinchilla theme, more so post)

7.6 Llama-3-405B reality#

Llama-3-405B: 405B params, 15T tokens. Ratio ~37:1 (Chinchilla-near). Largest publicly known. Compute cost ~$500M+. Future scale: economic question.
✅ Ders 12.2 Özeti — Chinchilla
Hoffmann 2022 Chinchilla: Kaplan refutation. Optimal allocation: N ≈ 20 × D (token-to-param ratio). Modern büyük modeller (GPT-3, PaLM) undertrained. 70B Chinchilla 280B Gopher'ı yener. Llama-3 strategy: Chinchilla-overtraining (inference efficiency için). DeepSeek-V3, Mistral, Qwen aynı pattern. Türkçe için pratik: corpus scarcity → small model + multilingual mix. Ders 12.3'te capstone — kendi training compute budget'ı planlamak.

Sıradaki Ders: Compute Budget Planning Capstone#

Ders 12.3: kendi LLM training budget'ı planla. Compute estimator, Chinchilla-aware allocation, cost calculator (GPU-hour, $).

Frequently Asked Questions

Inference economy. Smaller model + more tokens > larger model Chinchilla-optimal — inference 100x cheaper (less params), training amortized over many users. Inference > train compute long-run.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content