Chinchilla Scaling Laws (2022): Hoffmann et al. — 1:1 Param:Data Devrim
Hoffmann et al. 2022 'Training Compute-Optimal LLMs' paper'ı — Kaplan'ı düzeltti. Kaplan undertrained models bias. Chinchilla recipe: N ≈ D (1:1 ratio). 70B Chinchilla model > 280B Gopher (Hoffmann). Llama-3 Chinchilla-aware. Compute-optimal formula yeni, post-Chinchilla overtraining trend.
Şükrü Yusuf KAYA
70 dakikalık okuma
İleri🐀 Chinchilla — endüstrinin yön değiştirdiği paper
Mart 2022. DeepMind: Hoffmann, Borgeaud, Mensch et al. 'Training Compute-Optimal LLMs' paper'ını yayınladı. Bombshell: Kaplan recommendation yanlıştı. Modern büyük modeller (GPT-3, Gopher, PaLM) undertrained. Optimum allocation: N ≈ D (parameter count ≈ token count). 70B Chinchilla model 4× daha az param ile 280B Gopher'ı yener. Llama-3 Chinchilla-aware: 70B param, 15T token. Post-2022 industry shift: 'data > params'. 70 dakika sonra: Chinchilla'nın matematiksel anatomisini, methodology farklılıklarını, modern LLM training implication'larını kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- Hoffmann methodology — Kaplan'dan farklı approach
- 400 modeller training — careful experimental design
- Compute-optimal frontier — 1:1 N
- Math formula — yeni power law
- Chinchilla 70B vs Gopher 280B — head-to-head
- Modern model retrofit — GPT-3 etc. undertrained
- Llama-3 Chinchilla-aware — 70B / 15T tokens
- Post-Chinchilla overtraining — DeepSeek-V3 overtrain trend
- Implications — endüstri yön değiştirdi
- Türkçe için — Chinchilla recipe pratik anlam
1-4. Chinchilla Methodology#
1.1 Kaplan'dan fark#
Kaplan 2020:
- Fixed model size, vary data
- Compute relation reasoned from extrapolation
- Models often undertrained (data limited)
Hoffmann 2022:
- Vary BOTH N and D simultaneously
- 400+ training runs
- 70M to 16B params, 5B to 500B tokens
- Truly explore (N, D) space
1.2 Key finding#
Given compute C:
Optimal N* ∝ C^0.5 Optimal D* ∝ C^0.5 N* / D* ≈ constant ≈ 1:20 ratio in tokens (or 1:1 in 'normalized terms')
Actually: D / N ≈ 20 (in token to param ratio). For 70B params: 1.4T tokens.
1.3 Updated power law#
L(N, D) = E + A / N^α + B / D^β E ≈ 1.69 (irreducible loss) A ≈ 406, B ≈ 410 α ≈ 0.34, β ≈ 0.28
Different exponents than Kaplan. Reflect actual relationship better.
1.4 Compute-optimal allocation#
Minimize L subject to C = 6 N D. Lagrangian → critical points:
N* = G × (C / 6)^a where a ≈ 0.5 D* = G^{-1} × (C / 6)^b where b ≈ 0.5
N and D scale equally with compute.
1.5 Chinchilla model#
Hoffmann trained Chinchilla:
- N = 70B params
- D = 1.4T tokens
- C = ~5.9 × 10^23 FLOPs
Compare Gopher (DeepMind 2021):
- N = 280B params (4x bigger)
- D = 300B tokens (5x less)
- C = same as Chinchilla
Same compute, Chinchilla wins all benchmarks. Proves: Gopher undertrained.
1.6 Other 2020-2021 models retrofit#
Using Chinchilla framework:
- GPT-3 175B: optimal D ≈ 3.5T (actual 300B → undertrained 12x)
- PaLM 540B: optimal D ≈ 11T (actual 780B → undertrained 14x)
- All major LLMs undertrained
1.7 Llama applications#
- Llama-2 7B: 1.4T tokens (~200:1 — actually overtrained per Chinchilla)
- Llama-2 70B: 1.4T tokens (~20:1 Chinchilla-optimal)
- Llama-3 8B: 15T tokens (~1875:1 — heavily overtrained)
- Llama-3 70B: 15T tokens (~214:1 — overtrained)
Llama-3 deliberately overtrains for inference efficiency: smaller model with more data more cost-effective at inference.
7-10. Post-Chinchilla Era#
7.1 'Overtraining' trend#
2023-2024 industry shift:
- Llama-3 8B: 15T tokens (>2000:1)
- Mistral 7B: 8T tokens (>1100:1)
- Qwen 2: 7-15T tokens overtrain
Niye overtrain (vs Chinchilla optimal):
- Inference cost: smaller model cheaper to serve
- Train-time compute: amortized over many users
- Quality keeps improving even past Chinchilla-optimal
New motto: 'small model, lots of data, deploy efficiently'.
7.2 DeepSeek-V3 overtraining frontier#
DeepSeek-V3 (2024): 14.8T tokens for 671B param MoE (active 37B).
MoE Chinchilla calculation different — sparse compute.
7.3 Mosaic + DBRX scaling#
Mosaic ML Research: 2024 papers showing optimal allocation depends on inference economics.
7.4 Türkçe scaling implications#
- 7B Türkçe model: ~1-2T Türkçe tokens optimal (Chinchilla)
- Overtrain: 5-10T Türkçe (mevcut değil, 30 GB max ~5B tokens)
- Practical: Türkçe corpus sınırlı → small Turkish-only model + multilingual base mix
7.5 Industry direction (2026)#
- Compute getting cheaper
- Data scarcity becoming bottleneck
- Synthetic data + curation > raw scale
- Quality > quantity (already pre-Chinchilla theme, more so post)
7.6 Llama-3-405B reality#
Llama-3-405B: 405B params, 15T tokens. Ratio ~37:1 (Chinchilla-near).
Largest publicly known. Compute cost ~$500M+. Future scale: economic question.
✅ Ders 12.2 Özeti — Chinchilla
Hoffmann 2022 Chinchilla: Kaplan refutation. Optimal allocation: N ≈ 20 × D (token-to-param ratio). Modern büyük modeller (GPT-3, PaLM) undertrained. 70B Chinchilla 280B Gopher'ı yener. Llama-3 strategy: Chinchilla-overtraining (inference efficiency için). DeepSeek-V3, Mistral, Qwen aynı pattern. Türkçe için pratik: corpus scarcity → small model + multilingual mix. Ders 12.3'te capstone — kendi training compute budget'ı planlamak.
Sıradaki Ders: Compute Budget Planning Capstone#
Ders 12.3: kendi LLM training budget'ı planla. Compute estimator, Chinchilla-aware allocation, cost calculator (GPU-hour, $).
Sık Sorulan Sorular
Inference economy. Smaller model + more tokens > larger model Chinchilla-optimal — inference 100x cheaper (less params), training amortized over many users. Inference > train compute long-run.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
İlgili İçerikler
Modül 0: Kurs Çerçevesi ve Atölye Kurulumu
LLM Engineer Kimdir? Junior'dan Staff'a Yapay Zekâ Mühendisliği Kariyer Haritası
Öğrenmeye BaşlaModül 0: Kurs Çerçevesi ve Atölye Kurulumu
Kurs Felsefesi: Neden Bu Yol, Neden Bu Sıra — 8 Aylık Müfredatın İskeleti
Öğrenmeye BaşlaModül 0: Kurs Çerçevesi ve Atölye Kurulumu