Was Kaplan paper actually 'wrong'?

Kaplan Scaling Laws (2020): Power Law Anatomy of LLM Performance — Compute, Data, Param Triangle

Anatomy of Kaplan et al. 2020 paper: LLM loss follows power law for compute (C), parameters (N), data (D). Why log-log plot is linear, optimum allocation formula, 'bigger is better' claim, GPT-3 (175B) was built on this. Limitations and subsequent Chinchilla refutation.

Şükrü Yusuf KAYA

65 min read

5/13/2026

Advanced

Kaplan Scaling Laws (2020): LLM Performansının Power Law Anatomisi — Compute, Data, Param Üçgeni

📈 Kaplan 2020 — LLM'in 'kanunu' yayınlandığı gün

Ocak 2020. OpenAI'dan Kaplan, McCandlish, Henighan ve ekip: 'Scaling Laws for Neural Language Models'. 28 sayfalık paper, AI'in geleceğini şekillendirdi. İddia: LLM loss power law'a uyar — daha çok compute, daha çok param, daha çok data → daha iyi loss. Predictable, smooth, no asymptote görünür. Bu paper'ın direkt çıktısı: GPT-3 (175B param, $4M compute). 'Just go bigger' felsefesi. 65 dakika sonra: Kaplan'ın matematiksel formüllerini, log-log plot intuition'ını, scaling law'un GPT-3'ten Llama-3'e endüstri trajectory'sini şekillendiren etkisini derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

Pre-2020 LLM scaling intuition — büyütmenin sınırı var mıydı?
Kaplan 2020 setup — experimental methodology
Power law equations — L(C), L(N), L(D)
Log-log plots — niye linear
Optimum allocation — given compute, how to split N and D
Compute-optimal frontier — Kaplan original recommendation
GPT-3 connection — 175B param origin
Limitations — küçük model bias, undertrained
Loss → quality — nasıl downstream task'a yansıyor
Sonraki paper'lar — Chinchilla devrim önizleme

2-4. Kaplan Math#

2.1 Setup#

Kaplan team: LLM training runs varying N (params), D (data), C (compute). Validation loss measure.

Key question: L (loss) bu üç değişkene nasıl bağlı?

2.2 Power law formula#

L(N) = (N_c / N)^α_N
L(D) = (D_c / D)^α_D
L(C) = (C_c / C)^α_C

Değerler (Kaplan 2020 reported):

α_N ≈ 0.076 (parameter scaling exponent)
α_D ≈ 0.095 (data scaling exponent)
α_C ≈ 0.050 (compute scaling exponent)
N_c, D_c, C_c — model-class-specific constants

2.3 Power law intuition#

L ∝ N^(-α). 2x params → loss × 2^(-0.076) ≈ 0.95 × original. %5 loss reduction per doubling.

Doubling many times:

10x: 10^(-0.076) ≈ 0.84
100x: ≈ 0.71
1000x: ≈ 0.59

Diminishing returns ama no plateau.

2.4 Log-log plot#

log(L) = -α log(N) + const

Log-log axes: power law → linear line. Predictable extrapolation.

2.5 Combined formula#

L(N, D) = combined power law — multi-variable.

2.6 Compute = 6 × N × D (rough)#

For transformer, compute roughly:

C ≈ 6 × N × D (FLOPs)

6 = forward + backward (3 each), some approximation. Used for compute-optimal analysis.

5-7. Compute-Optimal Allocation#

5.1 Given C, optimize N and D#

Fixed compute budget C. Question: ne kadarını N'e ne kadarını D'ye?

5.2 Kaplan recommendation#

L(N, D) = L_∞ + (N_c / N)^α_N + (D_c / D)^α_D

Optimize subject to C = 6 N D: Kaplan found: scale N faster than D. For 10x compute, scale N by ~5x, D by ~2x.

Reason: N exponent (0.076) larger than D exponent (0.095) when normalized properly... wait this is counterintuitive.

Actually Kaplan recommendation: N grows faster because diminishing returns differently structured.

5.3 GPT-3 origin#

Following Kaplan: OpenAI scaled GPT-3 to 175B params with limited data (300B tokens).

N = 175B (large)
D = 300B tokens (modest)
Compute: ~ $5M

Result: GPT-3 (Brown 2020). Strong few-shot learning. Validated 'bigger is better' thesis.

5.4 GPT-3 success#

Few-shot performance breakthrough. Trillion-dollar industry trajectory started.

5.5 Pre-Chinchilla era pattern#

OpenAI, Google, Anthropic 2020-2022: very large models, modest data.

GPT-3: 175B / 300B
PaLM: 540B / 780B
Gopher: 280B / 300B

All undertrained per Chinchilla (Ders 12.2).

✅ Ders 12.1 Özeti — Kaplan Scaling Laws

Kaplan 2020: LLM loss power law'a uyar — L ∝ N^(-0.076). Log-log plot linear, predictable extrapolation. GPT-3 origin: 175B params, $5M compute, Kaplan recommendation following. Pre-Chinchilla pattern: very large model, modest data. Diminishing returns ama no plateau — 'bigger is better'. Limitations: undertrained models, küçük model bias. Ders 12.2'de Chinchilla refutation'a geçeceğiz.

Sıradaki Ders: Chinchilla 2022 Devrim#

Ders 12.2: Hoffmann et al. 2022, 'Training Compute-Optimal LLMs' — Kaplan'ı düzeltti. Models were undertrained. 1:1 N

optimal. Llama-3 Chinchilla-aware training.

Frequently Asked Questions

Wrong methodology (constant model sizes, varying data). Chinchilla 2022 corrected. Kaplan paper's mathematical framework is right, but the optimum allocation calculation was wrong.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...