Skip to content

Kaplan Scaling Laws (2020): Power Law Anatomy of LLM Performance — Compute, Data, Param Triangle

Anatomy of Kaplan et al. 2020 paper: LLM loss follows power law for compute (C), parameters (N), data (D). Why log-log plot is linear, optimum allocation formula, 'bigger is better' claim, GPT-3 (175B) was built on this. Limitations and subsequent Chinchilla refutation.

Şükrü Yusuf KAYA
65 min read
Advanced
Kaplan Scaling Laws (2020): LLM Performansının Power Law Anatomisi — Compute, Data, Param Üçgeni
📈 Kaplan 2020 — LLM'in 'kanunu' yayınlandığı gün
Ocak 2020. OpenAI'dan Kaplan, McCandlish, Henighan ve ekip: 'Scaling Laws for Neural Language Models'. 28 sayfalık paper, AI'in geleceğini şekillendirdi. İddia: LLM loss power law'a uyar — daha çok compute, daha çok param, daha çok data → daha iyi loss. Predictable, smooth, no asymptote görünür. Bu paper'ın direkt çıktısı: GPT-3 (175B param, $4M compute). 'Just go bigger' felsefesi. 65 dakika sonra: Kaplan'ın matematiksel formüllerini, log-log plot intuition'ını, scaling law'un GPT-3'ten Llama-3'e endüstri trajectory'sini şekillendiren etkisini derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. Pre-2020 LLM scaling intuition — büyütmenin sınırı var mıydı?
  2. Kaplan 2020 setup — experimental methodology
  3. Power law equations — L(C), L(N), L(D)
  4. Log-log plots — niye linear
  5. Optimum allocation — given compute, how to split N and D
  6. Compute-optimal frontier — Kaplan original recommendation
  7. GPT-3 connection — 175B param origin
  8. Limitations — küçük model bias, undertrained
  9. Loss → quality — nasıl downstream task'a yansıyor
  10. Sonraki paper'lar — Chinchilla devrim önizleme

2-4. Kaplan Math#

2.1 Setup#

Kaplan team: LLM training runs varying N (params), D (data), C (compute). Validation loss measure.
Key question: L (loss) bu üç değişkene nasıl bağlı?

2.2 Power law formula#

L(N) = (N_c / N)^α_N L(D) = (D_c / D)^α_D L(C) = (C_c / C)^α_C
Değerler (Kaplan 2020 reported):
  • α_N ≈ 0.076 (parameter scaling exponent)
  • α_D ≈ 0.095 (data scaling exponent)
  • α_C ≈ 0.050 (compute scaling exponent)
  • N_c, D_c, C_c — model-class-specific constants

2.3 Power law intuition#

L ∝ N^(-α). 2x params → loss × 2^(-0.076) ≈ 0.95 × original. %5 loss reduction per doubling.
Doubling many times:
  • 10x: 10^(-0.076) ≈ 0.84
  • 100x: ≈ 0.71
  • 1000x: ≈ 0.59
Diminishing returns ama no plateau.

2.4 Log-log plot#

log(L) = -α log(N) + const
Log-log axes: power law → linear line. Predictable extrapolation.

2.5 Combined formula#

L(N, D) = combined power law — multi-variable.

2.6 Compute = 6 × N × D (rough)#

For transformer, compute roughly:
C ≈ 6 × N × D (FLOPs)
6 = forward + backward (3 each), some approximation. Used for compute-optimal analysis.

5-7. Compute-Optimal Allocation#

5.1 Given C, optimize N and D#

Fixed compute budget C. Question: ne kadarını N'e ne kadarını D'ye?

5.2 Kaplan recommendation#

L(N, D) = L_∞ + (N_c / N)^α_N + (D_c / D)^α_D
Optimize subject to C = 6 N D: Kaplan found: scale N faster than D. For 10x compute, scale N by ~5x, D by ~2x.
Reason: N exponent (0.076) larger than D exponent (0.095) when normalized properly... wait this is counterintuitive.
Actually Kaplan recommendation: N grows faster because diminishing returns differently structured.

5.3 GPT-3 origin#

Following Kaplan: OpenAI scaled GPT-3 to 175B params with limited data (300B tokens).
  • N = 175B (large)
  • D = 300B tokens (modest)
  • Compute: ~ $5M
Result: GPT-3 (Brown 2020). Strong few-shot learning. Validated 'bigger is better' thesis.

5.4 GPT-3 success#

Few-shot performance breakthrough. Trillion-dollar industry trajectory started.

5.5 Pre-Chinchilla era pattern#

OpenAI, Google, Anthropic 2020-2022: very large models, modest data.
  • GPT-3: 175B / 300B
  • PaLM: 540B / 780B
  • Gopher: 280B / 300B
All undertrained per Chinchilla (Ders 12.2).
✅ Ders 12.1 Özeti — Kaplan Scaling Laws
Kaplan 2020: LLM loss power law'a uyar — L ∝ N^(-0.076). Log-log plot linear, predictable extrapolation. GPT-3 origin: 175B params, $5M compute, Kaplan recommendation following. Pre-Chinchilla pattern: very large model, modest data. Diminishing returns ama no plateau — 'bigger is better'. Limitations: undertrained models, küçük model bias. Ders 12.2'de Chinchilla refutation'a geçeceğiz.

Sıradaki Ders: Chinchilla 2022 Devrim#

Ders 12.2: Hoffmann et al. 2022, 'Training Compute-Optimal LLMs' — Kaplan'ı düzeltti. Models were undertrained. 1:1 N
optimal. Llama-3 Chinchilla-aware training.

Frequently Asked Questions

Wrong methodology (constant model sizes, varying data). Chinchilla 2022 corrected. Kaplan paper's mathematical framework is right, but the optimum allocation calculation was wrong.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content