Do scaling laws work the same for Turkish?

Generally yes but **constants differ**. Power law exponents (α, β) appear language-agnostic (universal). But **irreducible loss E** and **scale constants A, B** depend on language. For Turkish: (1) **E higher** (Turkish has more complex morphology, lower predictability). (2) **A, B** based on Turkish corpus characteristics. Practical: Chinchilla's 20:1 ratio is **approximately** correct for Turkish too. Just **starting baseline loss higher**. Modules 14 (Pretrain Data) and 58 (Turkish NLP) detail.

Is 60%+ MFU practical or a marketing claim?

Practical but **very hard**. Frontier labs (Meta, OpenAI, Anthropic) report this. Reaching: (1) **3D parallelism** optimal config, (2) **Custom CUDA kernels** for hot path, (3) **Communication overlap** with compute, (4) **Mixed precision** native ops. Llama 3 team reported 55-60% MFU. Most Turkish companies are considered good at 30-40% MFU. 60%+ requires frontier-level engineering team.

Is $100M training cost real or frontier lab hype?

Real. GPT-4 (rumored) ~$60-100M compute alone. GPT-5 likely $300M+. Frontier model training: 25K-50K H100 cluster × months × $3-4/hr. Plus: infrastructure ($50M+ data center), salaries ($50M+ team), data acquisition + curation ($20M+). **Total $500M+** for GPT-5 class. That's why only 5-7 labs at frontier (OpenAI, Anthropic, DeepMind, xAI, Mistral, DeepSeek, Meta). Open-source Llama 4 is Meta's $1B+ investment.

Are scaling laws dead with diminishing returns?

No, **evolving**. Pure parameter scaling shows diminishing returns. But: (1) **Data quality scaling** still strong (synthetic data, curation). (2) **Compute scaling for reasoning**: test-time compute (o1, R1) new axis. (3) **Architecture innovations** (MoE, MLA) increase effective scale. (4) **Multimodal scaling**: text + image + audio jointly. Modern scaling laws are **multi-dimensional**. In 2026, pure 'scale up N' is marginal but coordinated multi-axis scaling remains the frontier approach.

Practical scaling advice for Turkish startups?

Three tiers: (1) **Bootstrap** ($1-10K): Llama 3 8B fine-tune, Turkish instruction data. (2) **Seed-funded** ($50-200K): Continued pretrain Llama 3 8B + 100B Turkish tokens. (3) **Series A** ($1M+): Custom Turkish pretrain 3-8B model. **Recommendation**: start with tier 1. Scale up after finding product-market fit. Don't try to compete with frontier labs — optimize for niche. TÜBİTAK ARDEB grants in Türkiye are in suitable $500K-2M range. Module 60 (Türkiye AI Ecosystem) details.

Scaling Laws Intuition: Kaplan, Chinchilla, Post-Chinchilla — Mathematical Planning of LLM Training

Complete analysis of the mathematical foundations of LLM training: Kaplan 2020 power laws, Chinchilla 2022 compute-optimal theorem, post-Chinchilla over-training (Llama 3 approach), inference-aware scaling (Sardana 2023), μP hyperparameter transfer, FLOP calculation, MFU optimization.

Şükrü Yusuf KAYA

60 min read

5/13/2026

Advanced

Scaling Laws Sezgisi: Kaplan, Chinchilla, Post-Chinchilla — LLM Eğitiminin Matematiksel Planlaması

💰 LLM eğitiminin bütçe matematiği

1B model 10T token ile eğit mi, 10B model 1T token ile mi? Bu sorunun doğru cevabı GPT-3, Llama, Chinchilla'nın farklı verdiği. Yanlış cevap = milyonlarca dolar israf. Bu ders LLM eğitiminin bilimsel temellerini öğretiyor — Modül 16 (Scaling Laws) ve Modül 17 (Pre-training Compute) için zemin. 60 dakika sonra: bir LLM eğitim bütçesini matematik olarak planlayabileceksin.

Ders Haritası (Detaylı)#

Scaling laws nedir, niye keşfedildi?
Kaplan 2020 power laws — temel formüller
Chinchilla 2022 — Kaplan'ı revize
Post-Chinchilla — Llama 3 over-training paradigması
Inference-aware scaling (Sardana 2023)
FLOP hesaplama — 6ND yaklaşımı + tam formül
MFU (Model FLOPs Utilization)
μP parameterization — hyperparameter transfer
Compute budget hesaplama çalışma tablosu
Türkçe LLM bütçe örnek
Diminishing returns: scaling'in sınırı

1. Scaling Laws Nedir, Niye Keşfedildi?#

Scaling laws: model performansı (loss) ile model boyutu (N), eğitim verisi (D), compute (C) arasındaki matematiksel ilişki.

Klasik ML'de yoktu#

Klasik ML'de "daha fazla veri / daha büyük model her zaman iyi mi?" sorusunun cevabı bağlam-spesifikti. NN'ler için: bazı sürpriz, bazı şüphe.

Kaplan 2020 değiştirdi#

OpenAI'dan Kaplan et al. "Scaling Laws for Neural Language Models" (Ocak 2020) — language model loss'unun N, D, C ile power law ilişkisi olduğunu empirik olarak gösterdi.

Niye önemli?#

Bütçe planlaması: $1M bütçe ile en iyi model nasıl eğitilir?
Trade-off: parametre mi data mı?
Frontier prediction: GPT-5 nasıl olur tahmin
Bilimsel paradigma: NN'ler predictably scale ediyor — şaşırtıcı bir keşif

"Scaling is all you need"#

Sutton'ın Bitter Lesson'ı (Modül 3.2) + scaling laws → 2020-2024 dominant felsefe. 2025 itibarıyla diminishing returns belirginleşti ama scaling hâlâ ana motor.

2. Kaplan 2020 — Temel Formüller#

Kaplan, McCandlish, Henighan et al. — "Scaling Laws for Neural Language Models", 2020.

Ana bulgular#

Test loss

L

aşağıdaki power laws ile fit:

a) N'e göre (data ve compute sınırsız)

L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}

α_N ≈ 0.076

N_c ≈ 8.8 × 10^13

b) D'ye göre (parametre ve compute sınırsız)

L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}

α_D ≈ 0.095

D_c ≈ 5.4 × 10^13

c) C'ye göre (data ve parametre sınırsız)

L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}

α_C ≈ 0.050

Combined formula#

Hem N hem D sonlu:

L(N, D) \approx \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}

Compute-optimal allocation (Kaplan)#

Compute

C

verilirse, optimal

N

D

N_{opt} \propto C^{0.73}

D_{opt} \propto C^{0.27}

Yani: N >> D scale et. Kaplan'a göre compute artarsa modeli büyütmek > daha fazla veri.

GPT-3'ün tasarımı#

GPT-3 (2020) Kaplan'a göre kurulmuş: 175B parameter, 300B token (parameter

≈ 1:1.7).

Sorun#

Sonra Chinchilla "Kaplan yanlış" dedi.

3. Chinchilla 2022 — Kaplan'ı Düzeltme#

Hoffmann, Borgeaud, Mensch et al. (DeepMind) — "Training Compute-Optimal Large Language Models", 2022.

Ana iddia#

Kaplan yanlış scale ediyordu. Compute artarsa N ve D eşit ölçüde scale edilmeli.

Optimal ratio: N : D ≈ 1 : 20 (parameter : token)

Empirik temel#

Chinchilla ekibi 400+ model eğitti (70M-16B parametre, 5B-500B token). Her bir compute budget için optimal allocation buldu. Sonuç: Kaplan'dan dramatically farklı.

Yeni formüller#

L(N, D) \approx E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

A ≈ 406.4

B ≈ 410.7

E ≈ 1.69

α ≈ 0.34

β ≈ 0.28

E

= irreducible loss — data'nın doğal entropisi (Shannon lower bound).

Compute-optimal#

N

D

ikisini birden optimize edersen:

N_{opt} \propto C^{0.50}

D_{opt} \propto C^{0.50}

Eşit ölçeklendirme! Kaplan'ın 0.73/0.27'sinden çok farklı.

Chinchilla model#

Eğittikleri Chinchilla 70B — GPT-3 175B'den küçük ama daha fazla data (1.4T vs 300B). Performansta GPT-3'ten iyi, çok daha az parameter ile.

Bu, Llama 1, 2, 3'ün stratejisi.

Kim haklıydı?#

Chinchilla. Sonraki çalışmalar (Llama 3, GPT-4) onayladı. Kaplan'ın hyperparameter (özellikle learning rate decay) sub-optimal'di, bu yüzden N-D allocation'ı yanlıştı.

4. Post-Chinchilla — Llama 3 Over-Training#

Chinchilla "compute-optimal" diyor:

D ≈ 20N

. Llama 3 8B bunu 5x aşıyor: 15T token (Chinchilla optimum ≈ 160B token).

Niye over-training?#

Inference-aware thinking:

Chinchilla compute-optimal: training pahalı, inference cheap
Pratik gerçek: training bir kez (örn. $50M)
              inference milyarlarca (yıllık $50M+)

Total cost = training + inference. Inference dominant ise:

Daha küçük model (faster inference)
Daha fazla data ile train (kompansa için)

Llama 3.1 8B örneği#

Chinchilla optimum: 8B × 20 = 160B token
Llama 3.1 actual: 15T token (94x üstü!)
Result: Llama 3.1 8B performans Llama 2 70B'ye yakın (10x daha az parameter)

Maliyet trade-off#

Llama 2 70B inference: $7-10 per million tokens
Llama 3 8B inference: $0.5-1 per million tokens

Aynı kullanım: 10x daha ucuz inference

10x cheaper inference × milyarlarca call >>> 5x more expensive training (bir kez).

Sardana 2023 formalize#

Sardana, Frankle — "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws" (2023).

Inference cost'u objective'e dahil et:

D_critical: belli inference volume'da N model size'ın optimum D'si
Result:
D_critical >> 20N
modern production'da

Pratik mesaj#

Production modeli için: Chinchilla'nın 20:1 ratio'sunu 50-100:1 yap. Inference dominant ise daha küçük + daha çok train.

5. FLOP Hesaplama — 6ND Yaklaşımı#

LLM eğitimi compute cost'u hesaplamak için klasik formül:

C_{training} \approx 6 \times N \times D

C

: training FLOPs.

N

: parameter count.

D

: training token count.

6

: factor (forward 2 + backward 4).

Niye 6?#

Forward pass: her parameter için ~2 FLOP (multiply + add)
Backward pass: forward'un ~2x'i (chain rule operations)
Toplam: 2 + 4 = 6 FLOP per parameter per token

Llama 3 8B örneği#

N = 8e9 parameter
D = 15e12 token
C = 6 × 8e9 × 15e12 = 7.2e23 FLOP

Modern H100: ~1.5e15 FLOP/s (BF16 sustained). Naive:

training_time = 7.2e23 / 1.5e15 = 4.8e8 seconds = 15.2 years on 1 GPU

8000 H100 ile: 17 saat. (Meta ~25,000 H100 cluster kullandı.)

Daha hassas formula#

Embedding'leri vb. dahil et:

C \approx 6 \times N_{non-embed} \times D + 12 \times L \times d_{model} \times T \times D

İkinci terim attention'ın quadratic complexity'sini ekler (L=layer, T=context length).

Cost'a çevirme#

H100 saat: ~

2-4 cloud. Llama 3 8B'nin pretraining cost'u (~17 saat × 8000 GPU ×

2.5):

training_cost = 17 × 8000 × 2.5 ≈ $340K

Plus infrastructure, salaries, debugging, multi-restart → $5-10M real cost.

GPT-4 sınıfı modeller (rumored): $60-100M training cost.

6. MFU — Model FLOPs Utilization#

GPU spec sheet'inde peak FLOPs var (H100: 1979 TFLOPS BF16). Gerçek workload bu peak'in fraction'ını kullanır.

MFU formula#

\text{MFU} = \frac{\text{achieved FLOPs}}{\text{peak FLOPs}}

Tipik değerler#

Workload	MFU
Naive PyTorch eager	15-25%
With FlashAttention	35-45%
Megatron-LM	45-55%
Best in class (Meta, Anthropic)	55-65%

Niye MFU < 100%?#

Memory bandwidth bottleneck: matrix loading slow
Communication: distributed training'de all-reduce
Kernel launch overhead: küçük op'lar
Numeric precision: BF16 vs FP16 vs FP8

MFU optimization#

Modern infra (Modül 17):

FlashAttention: 2x throughput
Mixed precision: 2-3x
3D parallelism: large model scalability
Custom kernels (Triton): edge case gains
torch.compile: ek %10-20

Türk perspektifi#

H100 cluster'ı Türkiye'de hâlâ nadir. Pratik: cloud provider'larda Modal, Runpod, Lambda Labs'tan kullanım. MFU monitoring kritik çünkü her saat $2-4 ödüyorsun.

7. μP Parameterization — Hyperparameter Transfer#

Yang, Hu 2022 — "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer". Microsoft Research.

Problem#

Small model'de optimal hyperparameter (lr, init) → large model'de çuvallıyor. Niye? Forward pass'in varyansı scale ile değişiyor. Optimal lr de ölçeklenmesi gerekir.

μP (maximal update parameterization)#

Initialization ve learning rate'i scale-aware olarak ayarla. Sonuç: küçük model'de bulunan optimal hyperparameter büyük modelde de optimal.

Pratik#

1. Small model (örn. 1B) ile hyperparameter sweep
2. Optimal lr, init found: 3e-4, std=0.02
3. Scale up to 70B WITH μP scaling
4. Same hyperparameters work!

Avantaj#

Massive savings: 1B sweep $1K. 70B sweep$ 10M. μP ile 1B sweep yeter.
Predictable scaling: research → production transfer net

Adoption#

Microsoft (orijinal lab)
Some open-source (Cerebras Cerebras-GPT)
Çoğu frontier lab muhtemelen kullanıyor ama açıklamıyor

Sınırlar#

Implementation karmaşık (init + lr + attention scale)
Kritik scale'de gerçek tasarruf, küçük modellerde overhead
Modül 17 (Distributed Training) detayda

8. Compute Budget Hesaplama — Çalışma Tablosu#

Bir LLM projesi için compute budget nasıl planlanır?

Adım 1: Hedef performance belirle#

Loss target: e.g., L < 2.0 on Turkish data
Benchmark: e.g., MMLU >70%, GSM8K >50%
Application: e.g., customer support

Adım 2: Model size'a karar ver#

Chinchilla optimum:

C

verilirse

N_opt = sqrt(C/120)

. Post-Chinchilla (Llama 3 style):

N_opt = sqrt(C/500)

veya daha küçük.

Adım 3: Data needs#

D = (C / 6N) hesapla.

Veya: Chinchilla → 20N, Llama 3 → 50-100N.

Adım 4: GPU-hour estimation#

N=8B, D=15T tokens
C = 6 × 8e9 × 15e12 = 7.2e23 FLOP

H100 MFU 40%:
   sustained = 1.5e15 × 0.4 = 6e14 FLOP/s
   GPU-hours = 7.2e23 / 6e14 / 3600 = 3.3e5 = 330,000 GPU-hours

Adım 5: Cost#

$3/H100-hour (cloud)
total = 330,000 × $3 = $1M
plus: storage, networking, salaries, debugging → 2-3x
realistic budget: $2-3M for Llama 3 8B style training

Excel/Sheets template#

Parameter	Value	Formula
Target N	8e9	input
Token ratio	100	post-Chinchilla
D	8e11 ✕	N × ratio
FLOPs	3.84e22	6 × N × D
H100 hours @ 40% MFU	17,778	FLOPs / (1.5e15 × 0.4 × 3600)
Cost @ $3/hr	$53K	hours × $3
Real cost (3x)	$160K	× 3

Bu küçük model + az data. 15T token Llama 3 8B: 19x daha büyük → $3M+.

9. Türkçe LLM Bütçe Örneği#

Scenario: Türkçe customer support için 8B model eğitmek.

Approach 1: From-scratch pretrain#

N = 8B
D = 4T token (Türkçe-rich corpus — Türkçe için 20T zor)
FLOPs = 6 × 8e9 × 4e12 = 1.9e23
H100 hours @ 40% MFU: 88,000
Cost @ $3/hr:$ 264K
Real budget: $800K-1M

Pratik değil — kurumsal $5M+ budget gerekli. Trendyol ya da büyük şirket için.

Approach 2: Continued pretrain (Llama 3 üzerine)#

Base: Llama 3.1 8B (Meta zaten $3-5M harcamış)
Continued pretrain: 100B Türkçe token
FLOPs = 6 × 8e9 × 1e11 = 4.8e21
H100 hours @ 40% MFU: 2,222
Cost: $7K
Real budget: $25-50K

Kurumsal startup için makul. Çoğu Türkçe LLM bu yolu seçer.

Approach 3: Fine-tune#

Base: Llama 3.1 8B
Fine-tune: 1M instruction examples Türkçe
FLOPs: ~1e19 (very small)
H100 hours: ~6
Cost: $20
Real budget: $500 (engineering + eval)

Hızlı POC için ideal. Çoğu Türk şirketinin başlangıcı.

Karşılaştırma#

Approach	Maliyet	Süre	Quality
From-scratch	$800K-1M	3-6 ay	En iyi (ideal)
Continued pretrain	$25-50K	2-4 hafta	İyi
Fine-tune only	$500	1-3 gün	Domain için yeter

Modül 19 (Fine-tuning Karar) ve C12 (TurkInstruct-100K) capstone bu konuyu detaylandırıyor.

10. Diminishing Returns — Scaling'in Sınırı#

2020-2024: "scale = improvement" inkarsız. 2024-2026: diminishing returns belirginleşti.

Empirik gözlem#

GPT-3 (175B) → GPT-4 (~1.5T): büyük improvement
GPT-4 → GPT-4 Turbo: orta
GPT-4 Turbo → GPT-4o: küçük
GPT-4o → GPT-5 (auto-routing): mixed

Sebep: scaling alone artık marginal getiri veriyor. Compute katlanırken, accuracy lineer azlığında artıyor.

Niye?#

Data quality saturation: Common Crawl'ı tükettik
Architectural limits: Transformer paradigma sınırlı
Loss landscape: irreducible loss yakın
Diminishing curse of dimensionality: 1T param ek değer az

Frontier'ın yeni axes#

Reasoning (test-time compute): o1, R1
Multimodal: text + image + audio + video
Agentic: tool use, planning
Quality of data: synthetic, curated
Architecture innovation: MoE, MLA, state-space

Bunlar scale orthogonal axes. Pure scaling'in yerini alıyor.

2026 reality#

"Scaling hâlâ önemli ama tek başına yetmiyor. Modern progress scaling + architecture + reasoning + data quality kombinasyonu."

Modül 11 (Modern Architectures) ve Modül 25 (Reasoning) bunları detaylandırıyor.

11. Mini Egzersizler#

Compute calculation: 70B param, 2T token. FLOP, H100-hours @ 50% MFU?
Kaplan vs Chinchilla: $10M compute. Kaplan'a göre N, D? Chinchilla'ya göre?
Llama 3.3 over-training: 8B parameter, 15T token. Chinchilla ratio'ya göre kaç kat fazla?
MFU iyileştirme: 25% MFU'dan 50%'ye çıkmak training time'ı ne kadar azaltır?
Türkçe scenario: 1B Türkçe model from-scratch. Realistic budget?

Bu Derste Neler Öğrendik?#

✓ Kaplan 2020 — temel power laws, N >> D scaling ✓ Chinchilla 2022 — N

= 1:20 compute-optimal, eşit scale ✓ Post-Chinchilla — Llama 3 50-100:1 over-training ✓ Inference-aware scaling (Sardana 2023) ✓ FLOP hesaplama — 6ND formülü, embedding terms ✓ MFU optimization — 25% → 55% modern best ✓ μP parameterization — hyperparameter transfer ✓ Compute budget planning — adım adım Excel ✓ Türkçe LLM bütçeleri — from-scratch vs continued vs fine-tune ✓ Diminishing returns — scaling artık tek başına yetmiyor

Modül 4 progress: 6/8 ders#

Sıradaki Ders#

4.7 — Emergent Capabilities: Gerçek mi, Ölçüm Artefaktı mı? GPT-3 paper'ında 'emergent abilities at scale' iddiası — sonra Schaeffer 2023 'Are Emergent Abilities a Mirage?' meydan okudu. Metric design, threshold effects, modern empirical view. Hangi yetenek gerçekten emergent?

Frequently Asked Questions

Not completely wrong — **partially**. Kaplan correctly identified the power law structure (loss = f(N, D, C) via power laws). What was wrong: optimal N-D allocation, due to sub-optimal learning rate schedule. Chinchilla fixed it with more careful hyperparameters. **Practical message**: models designed per Kaplan between 2020-2022 (GPT-3, OPT, Bloom) ended up 'under-trained'. With the same compute, smaller + more data was better. Llama 1+ learned this lesson.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...