If I do manual `.bfloat16()` cast inside PyTorch's autocast, will it clash?

No clash but **unnecessary**. autocast handles operations automatically. Manual cast: (1) Risk of infinite cast loops. (2) Performance hit (extra memory copy). (3) Bug-prone (wrong dtype going into operator). Practical: trust autocast region, use manual cast only when **disabling autocast** or **forcing precision in special layer**. Modern code: set-and-forget autocast.

How much faster is FP8 than BF16 on H100?

**Theoretical 2x**, practical ~1.4-1.7x. Tensor Core FP8 989×2 = ~2 PFLOPS, BF16 989 TFLOPS. But real workloads: memory bandwidth, activation handling, precision overhead — 20-30% less than theoretical. DeepSeek-V3 paper reports practical ~1.5x. So when 1.4-1.7x becomes mainstream, standard pretraining cost drops by 30-50%.

Is there an accuracy difference between AdamW8bit and AdamW?

Empirically **negligible** (Dettmers 2022 paper). For 7B-70B models, accuracy difference < 0.5%. Memory savings significant (38%). Frontier labs (Mistral, Llama 3 fine-tunes) use widely. Practical: start with AdamW8bit, switch to classical AdamW if you see instability. Common combination with Module 21 (PEFT).

Why is computing loss in FP32 important?

Cross-entropy loss involves **log + sum** over vocabulary. Vocabulary 200K+ → small probabilities underflow in bf16. log(softmax) numerical instability. PyTorch's `F.cross_entropy` is autocast-aware: even if input is bf16, **internal computation is FP32**. If doing manually: `loss = -F.log_softmax(logits.float(), dim=-1).gather(...)`. Practical: PyTorch built-in losses are always safe — don't forget FP32 cast when writing your own loss.

What's the most common mistake in mixed precision LLM fine-tuning?

Top 3: (1) **FP16 + new GPU**: using FP16 + GradScaler on A100/H100 when BF16 is better. Typical mistake of engineers following old tutorials. (2) **Manual half-casting loss**: writing `loss.half()` causes instability. autocast handles it. (3) **Missing master weights**: when writing your own optimizer, no FP32 master = cumulative precision loss. In modern frameworks (DeepSpeed, FSDP) default is correct — manual tweaking is risky.

Would you try FP8 in your own fine-tune in 2026?

**Not yet for personal/small projects.** Reasons: (1) **Hardware**: H100+ needed, expensive hourly rental in Türkiye ($3-4). (2) **Tooling**: NVIDIA TE library has learning curve, debugging complex. (3) **Quality risk**: hidden quality drops in edge cases. (4) **Maintainability**: needs revisiting after 6 months. **Recommendation**: BF16 master, wait for FP8 in production until 2027. Only makes sense for **frontier scale** (multi-billion param custom pretrain).

Mixed Precision Training Production: BF16, FP16, FP8 Detaylı Rehber

Q: Why is computing loss in FP32 important?

Cross-entropy loss involves **log + sum** over vocabulary. Vocabulary 200K+ → small probabilities underflow in bf16. log(softmax) numerical instability. PyTorch's `F.cross_entropy` is autocast-aware: even if input is bf16, **internal computation is FP32**. If doing manually: `loss = -F.log_softmax(logits.float(), dim=-1).gather(...)`. Practical: PyTorch built-in losses are always safe — don't forget FP32 cast when writing your own loss.

Q: What's the most common mistake in mixed precision LLM fine-tuning?

Top 3: (1) **FP16 + new GPU**: using FP16 + GradScaler on A100/H100 when BF16 is better. Typical mistake of engineers following old tutorials. (2) **Manual half-casting loss**: writing `loss.half()` causes instability. autocast handles it. (3) **Missing master weights**: when writing your own optimizer, no FP32 master = cumulative precision loss. In modern frameworks (DeepSpeed, FSDP) default is correct — manual tweaking is risky.

Q: Would you try FP8 in your own fine-tune in 2026?

**Not yet for personal/small projects.** Reasons: (1) **Hardware**: H100+ needed, expensive hourly rental in Türkiye ($3-4). (2) **Tooling**: NVIDIA TE library has learning curve, debugging complex. (3) **Quality risk**: hidden quality drops in edge cases. (4) **Maintainability**: needs revisiting after 6 months. **Recommendation**: BF16 master, wait for FP8 in production until 2027. Only makes sense for **frontier scale** (multi-billion param custom pretrain).

Ders Haritası#

Mixed precision niye gerekli?
autocast — region semantics
GradScaler — dinamik scale factor
BF16 vs FP16 production karar matrisi
FP8 native training (DeepSeek-V3 case study)
Master weights — FP32 backup pattern
Loss spike investigation
Gradient norm monitoring
Optimizer states precision
Production checklist

1. Mixed Precision Niye Gerekli?#

Saf FP32 sorunları#

Bellek: 4 byte / parameter. 70B model = 280GB weights only. + activations + grads + optimizer states ≈ 1TB.
Compute: NVIDIA H100 FP32 ~67 TFLOPS, BF16 ~989 TFLOPS — 15x fark.

Saf FP16/BF16 sorunları#

Range/precision sınırı: gradient'lerin bazı parametre için underflow (FP16 ~6e-5'in altında 0'a düşüyor)
Master weights non-trivial: update'ler quantum'a takılıyor

Mixed precision çözüm#

Forward + backward: BF16 (veya FP16)
Weight + optimizer state: FP32
Loss accumulation: FP32
Layer norm, softmax: FP32 (numerical stability)
Matrix multiplications: BF16 (Tensor Core hızlı)

Bu best-of-both-worlds pattern. Modern LLM'in pretrain standardı.

2. autocast — Region Semantics#

PyTorch

torch.amp.autocast

context manager: belirli operasyonları otomatik half precision'a düşürür.

from torch.amp import autocast

with autocast(device_type="cuda", dtype=torch.bfloat16):
    out = model(x)
    loss = criterion(out, y)

loss.backward()
optimizer.step()

Hangi op autocast'lanır?#

PyTorch internally op listesi:

Half precision'a indirilen (BF16)

Matrix multiplication (matmul, addmm)
Convolution
Linear layer
RNN cell

FP32'de tutulan (numerical stability)

Softmax
Layer normalization
Log, Exp
Sum, mean (reductions)
Loss functions (cross-entropy)

Niye bu seçim?#

Matmul ve conv Tensor Core friendly (BF16'da Çok hızlı). Reductions (softmax, layernorm) FP32 precision gerek (sum overflow / underflow riski).

Custom decisions#

with autocast(device_type="cuda", dtype=torch.bfloat16):
    # Auto-downcast
    h = self.linear(x)

    # Force FP32 (custom override)
    with autocast(device_type="cuda", enabled=False):
        h_fp32 = self.special_op(h.float())

    # Continue in BF16
    out = self.output(h_fp32.to(torch.bfloat16))

autocast vs explicit dtype#

autocast (önerilen): operasyon-bazlı, PyTorch akıllıca seçer. Explicit bfloat16 cast: manuel, hata-prone (operatör bazlı düşünmek zor).

Cache enable#

PyTorch 2.0+:

torch.set_float32_matmul_precision("high")

— TF32 (Tensor Core FP32 mode) enable. Otomatik %2-3x kazanç FP32 ops için.

3. GradScaler — Dinamik Scale Factor#

Sadece FP16 için gerekli, BF16'da gerek yok.

Niye FP16'da gerek?#

FP16 range: ±65,504. Minimum normalized: ~6.1e-5. Subnormals ~6e-8'e kadar.

Bir LLM training'de gradient'ler tipik 1e-3 to 1e-7. Çoğu gradient FP16'da underflow → 0 → no update → öğrenme durmuş.

GradScaler çözüm#

loss × scale → scaled_loss
scaled_loss.backward() → scaled_gradients
unscale_(optimizer.params) → gradients (FP32'de)
optimizer.step() → normal update

Loss büyük scale ile çarpılıyor (örn. 65536), gradient FP16 range'inde temsil edilebilir oluyor. Optimizer step'inden önce unscale, normal update.

Dynamic scale#

GradScaler adaptive:

Gradient'ler overflow olursa (inf/nan): scale azalt, step skip
N step başarılı ise scale arttır

from torch.cuda.amp import GradScaler

scaler = GradScaler()  # initial scale 65536

for batch in dataloader:
    optimizer.zero_grad()

    with autocast(device_type="cuda", dtype=torch.float16):
        out = model(batch)
        loss = criterion(out, target)

    scaler.scale(loss).backward()      # scaled gradient
    scaler.unscale_(optimizer)         # unscale for clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)             # call optimizer, handle inf/nan
    scaler.update()                    # adjust scale

BF16'da niye gerek yok?#

BF16 range: ±3.4e38 (FP32 ile aynı). Underflow neredeyse hiç yok. GradScaler gerek yok.

Sonuç#

FP16 + GradScaler: eski donanım (Volta V100) için
BF16 (no GradScaler): modern donanım (A100, H100, B200)

4. BF16 vs FP16 — Production Karar Matrisi#

Kriter	BF16	FP16
Exponent bits	8	5
Mantissa bits	7	10
Max value	~3.4e38	65,504
Min normal	~1.2e-38	6.1e-5
Precision	~10⁻²	~10⁻³
Tensor Core support	A100+	Volta+
GradScaler gerek	Hayır	Evet
LLM pretrain	✓ Standard	✗ Niş
Vision/CV	✓	✓ (legacy)
Eski donanım (V100)	✗	✓
Stability	High	Moderate

Karar kuralı#

Donanım	Workload	Önerilen
H100, B200, B100	LLM pretrain	BF16
A100	LLM pretrain	BF16
V100 (Volta)	Anything	FP16 (BF16 yok)
T4 (Turing)	Inference	FP16
RTX 4090/5090	LLM fine-tune	BF16
Apple M-series	LLM inference	FP16 veya MLX FP16

Frontier lab'ların kullanımı#

Meta Llama 3, 4: BF16
OpenAI GPT-4, GPT-5: BF16 (rumored)
Anthropic Claude: BF16
DeepSeek-V3: FP8 (yeni!)
Mistral, Qwen: BF16

Modern standart BF16 + niş durumlarda FP8 (büyük scale).

5. FP8 Native Training — DeepSeek-V3 Case Study#

Hopper (H100, 2022) ve Blackwell (B100/B200, 2025) FP8 native support eklediler.

FP8 formatları#

E4M3: 4 exponent + 3 mantissa bit, max ~448. Forward için.
E5M2: 5 exponent + 2 mantissa bit, max ~57,344. Backward (gradient) için.

H100 FP8 throughput 2x BF16 = ~2 PFLOPS.

DeepSeek-V3 stratejisi (Aralık 2024)#

DeepSeek-V3 (671B param, MoE) native FP8 pretrain'ladı. Detaylar paper'da:

Per-block FP8 quantization: her 128-element block için ayrı scale
Mixed FP8/BF16: critical operations BF16, geri kalan FP8
Fine-grained quantization control: NVIDIA Transformer Engine ile
Per-layer scale tracking: auto-scaling per training step
Loss scaling: FP8'in dar range'i için

Pratik kullanım#

NVIDIA Transformer Engine library:

import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling, Format

fp8_recipe = DelayedScaling(
    fp8_format=Format.HYBRID,  # E4M3 forward, E5M2 backward
    amax_history_len=16,
    amax_compute_algo="max",
)

with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = te_model(inputs)

Adoption durumu#

DeepSeek-V3: native FP8 (production)
Llama 4 (rumored): kısmen FP8
Meta MI300X experiments: FP8 trials
Çoğu lab: 2025 sonu / 2026 başı evaluation aşamasında

Niye herkes geçmiyor?#

Numerical stability edge case'leri
Tooling immaturity — DeepSpeed/FSDP integration yeni
Hardware: H100 + var olmadıkça use case yok
Quality risk: %5+ kalite kayıp toleransı çoğu projede yok

Türk perspektif#

H100 cluster Türkiye'de hâlâ nadir. FP8 adoption Türkiye'de niş — Meta/DeepSeek ölçeğinde değil. Ama 2027'ye doğru frontier'da yer alır.

Modül 17 (Distributed Training) ve 32 (Quantization) detayda işliyor.

6. Master Weights — FP32 Backup Pattern#

Mixed precision'da master weights FP32 tutulur.

Niye?#

Weight update tipik tiny:

W_new = W_old - lr × grad
       = W_old - 1e-4 × 1e-3
       = W_old - 1e-7

BF16 precision ~10⁻². 1e-7 update noise içine gömülür → kaybedilir.

FP32 master weights#

Optimizer FP32 weight tutar:

forward: bf16(W_master) → compute
backward: produce grad
optimizer: W_master_fp32 -= lr × grad_fp32

Sonra her step'te

W_bf16 = W_master.to(bfloat16)

ile fresh copy.

Memory cost#

Weight: BF16 (2 bytes) + FP32 master (4 bytes) = 6 bytes/param
70B model: 70B × 6 = 420GB weight (BF16 alone 140GB)

Ek 280GB memory cost. Genelde optimizer state'ler içinde — pratik olarak optimizer state için zaten FP32 tutuluyor (Adam moments).

Optimization#

8-bit optimizer (bitsandbytes): optimizer state INT8'de tut, master weight FP32 ama gradient INT8'e quantize.

import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=3e-4)

Memory %50 tasarruf. Modül 21 (PEFT) ile yaygın combination.

7. Loss Spike Investigation#

Mixed precision training'de loss spike'lar yaygın. Investigation playbook:

Symptoms#

Loss curve smooth, sonra ANIDEN 10x artıyor
Veya: loss NaN
Veya: gradient norm explosion

Investigation adımları#

1. Gradient norm log

total_norm = 0.0
for p in model.parameters():
    if p.grad is not None:
        total_norm += p.grad.norm().item() ** 2
total_norm = total_norm ** 0.5
wandb.log({"grad_norm": total_norm})

Spike'tan önce grad_norm dramatically yükselirse → exploding gradient.

2. Per-layer activation stats

def hook(module, input, output):
    print(f"{module.__class__.__name__}: "
          f"out_mean={output.mean():.4f} out_std={output.std():.4f} "
          f"out_max={output.abs().max():.4f}")

for name, module in model.named_modules():
    if isinstance(module, (nn.Linear, nn.LayerNorm)):
        module.register_forward_hook(hook)

Hangi katmanda activation pattern bozuluyor — orada problem.

3. Mixed precision sebebi mi?

# FP32 ile aynı run'ı tekrarla
with autocast(enabled=False):
    out = model(x)
    loss = criterion(out, y)

FP32'de spike yoksa → mixed precision related.

4. Bad batch?

Spike öncesi batch'i kaydet, ayrı incele. Bazen corrupted document veya outlier.

5. Learning rate?

LR çok yüksek olabilir. Lr'i yarıya düşürüp test.

Modern spike çözümleri#

Gradient clipping:
max_norm=1.0
zaten standart
Warm-up extension: linear warmup'ı uzatma
β₂ azaltma: AdamW
β₂=0.95
→
0.9
(Modül 1.8)
Skip bad batch: spike sırasındaki batch'i drop
BF16'ya geç (eğer FP16'da iseniz)
Activation checkpointing: bazı katmanları FP32 zorla

Modül 17 distributed training kontekstinde bunu detaylandırıyor.

8. Gradient Norm Monitoring#

Production training'de must-have.

What to track#

# Total gradient norm
total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
wandb.log({"grad_norm": total_norm.item()})

# Per-layer gradient norm
layer_norms = {}
for name, p in model.named_parameters():
    if p.grad is not None:
        layer_norms[f"grad/{name}"] = p.grad.norm().item()
wandb.log(layer_norms)

# Gradient distribution stats
grad_max = max(p.grad.abs().max().item() for p in model.parameters() if p.grad is not None)
grad_min = min(p.grad.abs().min().item() for p in model.parameters() if p.grad is not None)
wandb.log({"grad_max": grad_max, "grad_min": grad_min})

# Update-to-weight ratio
for name, p in model.named_parameters():
    if p.grad is not None:
        update_norm = (3e-4 * p.grad).norm().item()  # lr * grad
        weight_norm = p.norm().item()
        ratio = update_norm / (weight_norm + 1e-8)
        wandb.log({f"update_ratio/{name}": ratio})

Healthy ranges#

Total grad norm: 0.1 - 5.0 (after warmup)
Update-to-weight ratio: ~1e-3 to 1e-2
Per-layer variation: ratio en yüksek / en düşük < 100x

Anomaly thresholds#

grad_norm > 100: yakında spike riski
grad_norm > 1000: spike active
NaN grad: tüm step skip et, optimizer.zero_grad()

Modern öneri#

Production'da otomatik alert:

grad_norm > 50

→ email/Slack → human review.

Modül 18 (Mini-LLM Pretrain Atölyesi) bunu pratik olarak gösteriyor.

9. Optimizer States Precision#

Adam/AdamW her parametre için 2 state (1st moment, 2nd moment) tutar.

Memory breakdown (70B model)#

Component	Precision	Memory
Weights	BF16	140GB
Master weights	FP32	280GB
Gradients	BF16	140GB
Adam m	FP32	280GB
Adam v	FP32	280GB
Total		1.12 TB

Plus activations + KV cache + overhead = ~1.5-2 TB total.

Optimization#

AdamW + 8-bit optimizer

optimizer = bitsandbytes.optim.AdamW8bit(...)
# Adam m/v INT8'de saklanır (her param 1 byte each)
# Memory: 140GB (weight) + 280GB (master) + 140GB (grad) + 70GB + 70GB = 700GB
# %38 tasarruf

Adam 4-bit (Lion vs Adam'a yakın)

Lion optimizer (Modül 1.8) single state kullanır → m sadece, v yok. Memory %25 tasarruf.

Shampoo / K-FAC (advanced)

Block-diagonal Hessian approximation. Daha pahalı per-step ama daha az step gerek.

Pratik#

LLM pretrain'de:

Memory tight → Lion veya AdamW8bit
Stability priority → AdamW (klasik)
Production proven → AdamW (Llama 3, Qwen, Mistral hepsi)

Modül 17 (Distributed Training) ZeRO ile bunu daha detaylandırıyor.

10. Production Mixed Precision Checklist#

Bir LLM training run başlatmadan önce:

Pre-flight#

Hardware compatibility: A100/H100/B200 → BF16 OK, V100 → FP16 + GradScaler
PyTorch version: 2.2+ önerilen (FP8 support için 2.4+)
NCCL version: latest (mixed precision NCCL allreduce)
CUDA version: H100 için 12.0+
DeepSpeed/FSDP config:
bf16=True
set
Optimizer states: 8-bit alternatif değerlendirildi mi?
Master weights: FP32 confirm
Loss function: FP32 cast confirm
TF32 enable:
torch.set_float32_matmul_precision("high")

Training-time monitoring#

Gradient norm logging (her step veya her N step)
Loss curve plot (real-time)
Per-layer activation stats (her 1000 step)
NaN detection:
torch.isnan(loss)
her step
Checkpoint frequency: spike sonrası recover için sık
Alert system: grad_norm > 50 → email/Slack
GPU memory peak: track (OOM avoidance)

Spike response#

Skip bad batch mechanism
Auto-resume from last good checkpoint
Lr halve if 3+ spike in N step
Switch to FP32 flag (last resort)

11. Mini Egzersizler#

BF16 vs FP16 decision: V100 GPU'lu cluster, 7B Llama fine-tune. Hangisi?
GradScaler skip: Bir training run'da scaler.step() N kez skip ettiyse, ne anlama gelir?
Memory calculation: 13B BF16 + AdamW FP32 + bf16 grad. Total GB?
FP8 production: Türk şirketi Llama 4 fine-tune yapacak. FP8 önerir misin?
Spike debug: 5000. step'te grad_norm 0.5 → 250. Investigation öncelik sırası?

Bu Derste Neler Öğrendik?#

✓ Mixed precision niye: 15x faster, 2x memory savings ✓ autocast region semantics — op-bazlı otomatik downcast ✓ GradScaler dynamics — FP16 underflow protection ✓ BF16 vs FP16 production karar matrisi ✓ FP8 native training — DeepSeek-V3 case study + Transformer Engine ✓ Master weights FP32 pattern + 8-bit optimizer alternatifi ✓ Loss spike investigation — 5-step playbook ✓ Gradient norm monitoring — production must-have ✓ Optimizer state precision — 70B model 1.5TB total memory ✓ Production checklist — pre-flight + monitoring + response

Sıradaki Ders#

5.3 — Memory Profiling: torch.profiler, Nsight Systems ve OOM Debugging 70B model 80GB GPU'da nasıl fit ediyor? OOM ne zaman olur? Hangi tensor en çok memory? Production profiling tools.

Ders Haritası#

1. Mixed Precision Niye Gerekli?#

Saf FP32 sorunları#

Saf FP16/BF16 sorunları#

Mixed precision çözüm#

2. autocast — Region Semantics#

Hangi op autocast'lanır?#

Half precision'a indirilen (BF16)

FP32'de tutulan (numerical stability)

Niye bu seçim?#

Custom decisions#

autocast vs explicit dtype#

Cache enable#

3. GradScaler — Dinamik Scale Factor#

Niye FP16'da gerek?#

GradScaler çözüm#

Dynamic scale#

BF16'da niye gerek yok?#

Sonuç#

4. BF16 vs FP16 — Production Karar Matrisi#

Karar kuralı#

Frontier lab'ların kullanımı#

5. FP8 Native Training — DeepSeek-V3 Case Study#

FP8 formatları#

DeepSeek-V3 stratejisi (Aralık 2024)#

Pratik kullanım#

Adoption durumu#

Niye herkes geçmiyor?#

Türk perspektif#

6. Master Weights — FP32 Backup Pattern#

Niye?#

FP32 master weights#

Memory cost#

Optimization#

7. Loss Spike Investigation#

Symptoms#

Investigation adımları#

1. Gradient norm log

2. Per-layer activation stats

3. Mixed precision sebebi mi?

4. Bad batch?

5. Learning rate?

Modern spike çözümleri#

8. Gradient Norm Monitoring#

What to track#

Healthy ranges#

Anomaly thresholds#

Modern öneri#

9. Optimizer States Precision#

Memory breakdown (70B model)#

Optimization#

AdamW + 8-bit optimizer

Adam 4-bit (Lion vs Adam'a yakın)

Shampoo / K-FAC (advanced)

Pratik#

10. Production Mixed Precision Checklist#

Pre-flight#

Training-time monitoring#

Spike response#

11. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

If I do manual `.bfloat16()` cast inside PyTorch's autocast, will it clash?

How much faster is FP8 than BF16 on H100?

Is there an accuracy difference between AdamW8bit and AdamW?

Why is computing loss in FP32 important?

What's the most common mistake in mixed precision LLM fine-tuning?

Would you try FP8 in your own fine-tune in 2026?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight