Debug Arsenal: register_hook, Anomaly Mode, torch.utils.benchmark — Production Debugging Toolkit
Toolkit for when things break in production PyTorch: forward/backward hooks, anomaly detection mode, deterministic training, torch.utils.benchmark precise timing, repro patterns, systematic NaN hunting, gradient inspection, model debugging strategies.
Şükrü Yusuf KAYA
55 min read
Intermediate🔧 'Çalıştı bende' demek profesyonel değil
LLM mühendisliğinde iş bozulur. Loss spike, NaN gradient, accuracy drop, OOM. Senior mühendisi junior'dan ayıran şey: systematic debugging. Bu ders production-grade debug toolkit. 55 dakika sonra: hooks ile anything inspect etmek, anomaly mode ile NaN kaynaklarını bulmak, deterministic training ile reproducible bugs, precise benchmarking — hepsi pratik olacak.
Ders Haritası#
- Forward hooks — module çıktısını intercept
- Backward hooks — gradient flow inspect
- Tensor hooks — bireysel tensor gradient
- Anomaly detection mode
- Deterministic training — reproducibility
- torch.utils.benchmark — precise timing
- NaN avı systematik
- Gradient inspection patterns
- Repro patterns — bug reproduction
- Production debug workflow
1. Forward Hooks — Module Çıktısını Intercept#
PyTorch ile her forward'da custom function çalıştır.
Module.register_forward_hookdef forward_hook(module, input, output): # input: tuple, module'a giren tensor'lar # output: module'un çıktısı print(f"{module.__class__.__name__}: out shape={output.shape}, mean={output.mean():.4f}") # Tüm Linear layer'lara hook ekle for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): module.register_forward_hook(forward_hook)
Use cases#
1. Activation statistics
activations = {} def save_activation(name): def hook(module, input, output): activations[name] = { "mean": output.mean().item(), "std": output.std().item(), "max": output.abs().max().item(), } return hook for name, module in model.named_modules(): if isinstance(module, (torch.nn.Linear, torch.nn.LayerNorm)): module.register_forward_hook(save_activation(name)) # Forward pass çalıştır out = model(x) # Inspect for name, stats in activations.items(): print(f"{name}: {stats}")
Hangi katmanda activation explosion var? Hook ile direkt görüyorsun.
2. Feature extraction
features = {} def get_features(name): def hook(module, input, output): features[name] = output.detach() return hook model.encoder.layer_5.register_forward_hook(get_features("layer_5")) out = model(x) intermediate = features["layer_5"]
ELMo, attention visualization gibi pattern'ler.
3. NaN/Inf detection
def nan_check_hook(module, input, output): if torch.isnan(output).any() or torch.isinf(output).any(): print(f"⚠️ NaN/Inf in {module.__class__.__name__}!") print(f" Input ranges: {[i.abs().max().item() for i in input if isinstance(i, torch.Tensor)]}")
NaN ortaya çıkan ilk layer'ı tespit.
Hook'u kaldırma#
handle = module.register_forward_hook(my_hook) # ... use ... handle.remove() # cleanup
Unutmazsan memory leak.
2. Backward Hooks — Gradient Flow Inspect#
Module.register_full_backward_hookdef backward_hook(module, grad_input, grad_output): # grad_input: module'a giren gradient # grad_output: module'un çıktısının gradient'i print(f"{module.__class__.__name__}: " f"grad_out_norm={grad_output[0].norm():.4f}, " f"grad_in_norm={grad_input[0].norm() if grad_input[0] is not None else 'None':.4f}") for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): module.register_full_backward_hook(backward_hook)
Use cases#
1. Vanishing/Exploding gradient detection
grad_stats = {} def check_gradient(name): def hook(module, grad_input, grad_output): grad_stats[name] = { "out_norm": grad_output[0].norm().item(), "in_norm": grad_input[0].norm().item() if grad_input[0] is not None else None, } return hook for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): module.register_full_backward_hook(check_gradient(name)) loss.backward() # Layer-by-layer gradient norm for name, stats in grad_stats.items(): print(f"{name}: out_norm={stats['out_norm']:.6f}")
Çok küçük gradient (1e-10) → vanishing. Çok büyük (1e10) → exploding.
2. Gradient sanity check
def assert_finite_grad(module, grad_input, grad_output): for g in grad_input + grad_output: if g is not None and not torch.isfinite(g).all(): raise RuntimeError(f"NaN/Inf gradient in {module.__class__.__name__}")
Production training'de erken NaN tespit.
Eski API (deprecated)#
register_backward_hookregister_full_backward_hook3. Tensor Hooks — Bireysel Tensor Gradient#
Module-level değil, tensor-level gradient inspection.
x = torch.randn(10, requires_grad=True) y = x * 2 z = y.sum() # x'in gradient'inde modify et def x_grad_hook(grad): print(f"x.grad: {grad}") return grad.clamp(-1, 1) # gradient clipping in-place x.register_hook(x_grad_hook) z.backward() # x.grad: tensor([2., 2., ..., 2.]) (clipped)
Pratik kullanım#
1. Specific parameter monitoring
# Çok büyük bir layer'ın specific weight'inin gradient'ini track attention_qkv_weight = model.transformer.h[0].attn.c_attn.weight attention_qkv_weight.register_hook(lambda g: print(f"QKV grad norm: {g.norm():.4f}"))
2. Gradient debugging
def trace_gradient(name): def hook(grad): if torch.isnan(grad).any(): print(f"⚠️ NaN in {name}.grad") # Optionally: set to zero return torch.zeros_like(grad) return grad return hook for name, param in model.named_parameters(): param.register_hook(trace_gradient(name))
Hook return değeri#
- : original gradient kullanılır
None - : yeni gradient (replace)
Tensor
Gradient modify etmek için return value.
4. Anomaly Detection Mode#
PyTorch'un built-in NaN debugger:
import torch.autograd torch.autograd.set_detect_anomaly(True) try: out = model(x) loss = criterion(out, target) loss.backward() except RuntimeError as e: print(f"Anomaly detected: {e}") # Detailed stack trace gösterilir
Ne tespit ediyor?#
- NaN gradients — hangi operation'da
- Inf gradients
- Modified leaf tensor (autograd issue)
- In-place operation problems
Performance overhead#
~5-10x slower — debug için, production'da kapalı kalmalı.
Context manager#
with torch.autograd.detect_anomaly(): loss = compute_loss(x, model) loss.backward()
Sadece bu region için debug. Performance impact minimize.
Typical workflow#
- Production training: NaN gözlemledin
- Repro small example (deterministic seed)
- Anomaly mode aç + repro çalıştır
- Detailed stack trace → hangi operation
- Root cause fix, anomaly mode kapat
- Production'da hâlâ NaN olursa: dataset-specific edge case investigation
5. Deterministic Training — Reproducibility#
LLM training'de "neden bu bug oluştu?" cevaplamak için bit-exact reproducibility gerek.
Random seeds set#
import random import numpy as np import torch def set_seed(seed): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) set_seed(42)
Deterministic algorithms#
torch.use_deterministic_algorithms(True) # Bazı CUDA ops'un deterministic versionunu zorla
Bazı kernel'lar non-deterministic by default (performans için). Bunu zorla → ~%10-30 yavaşlama.
CUDA-specific#
export CUBLAS_WORKSPACE_CONFIG=:4096:8 # determinism için gerekli
PyTorch error verir bu set olmamışsa.
Determinism trade-off#
| Mode | Speed | Reproducibility |
|---|---|---|
| Default | Hızlı | Aynı seed → benzer ama bit-exact değil |
| Deterministic | %10-30 yavaş | Bit-exact |
Pratik kullanım#
- Development: deterministic mode (debug için)
- Production: default mode (speed için)
- Repro a bug: deterministic + same seed
Distributed determinism#
Multi-GPU'da gradient reduce order non-deterministic → bit-exact zor. Çözüm: + deterministic mode + specific config. Frontier lab'lar bunu yapıyor critical debugging için.
NCCL_BLOCKING_WAIT=16. torch.utils.benchmark — Precise Timing#
time.perf_counter()torch.utils.benchmarkfrom torch.utils.benchmark import Timer t = Timer( stmt="model(x)", globals={"model": model, "x": x}, num_threads=1, ).timeit(100) print(f"Mean: {t.mean*1000:.3f} ms") print(f"Median: {t.median*1000:.3f} ms") print(f"IQR: {(t.iqr.high - t.iqr.low)*1000:.3f} ms")
Compare benchmarks#
from torch.utils.benchmark import Compare results = [] for size in [128, 256, 512, 1024]: for impl in ["eager", "compiled"]: x = torch.randn(32, size, device="cuda") m = compiled_model if impl == "compiled" else model results.append( Timer( stmt="m(x)", globals={"m": m, "x": x}, label="forward", description=impl, sub_label=f"size={size}", ).blocked_autorange() ) compare = Compare(results) compare.print()
Tablo halinde karşılaştırma — eager vs compiled, farklı size'lar.
Auto-thread tuning#
t = Timer(stmt="...", num_threads=torch.get_num_threads()).blocked_autorange()
blocked_autorange()CUDA-aware#
PyTorch benchmark CUDA sync otomatik handle — accurate GPU timing.
Practical patterns#
- Optimization regression: PR'dan önce/sonra benchmark
- Hardware comparison: H100 vs A100 same model
- Implementation comparison: PyTorch native vs custom Triton
- Mixed precision: BF16 vs FP32 vs FP8
python
# Production NaN hunter — comprehensiveimport torch class NaNHunter: def __init__(self, model): self.model = model self.hooks = [] self.first_nan_module = None def install(self): def make_hook(name): def hook(module, input, output): if self.first_nan_module is not None: return # already found if isinstance(output, torch.Tensor) and not torch.isfinite(output).all(): self.first_nan_module = name print(f"⚠️ NaN/Inf FIRST appears at: {name}") print(f" Module: {module.__class__.__name__}") for i, inp in enumerate(input): if isinstance(inp, torch.Tensor): print(f" Input {i}: shape={inp.shape}, " f"range=[{inp.abs().min():.2e}, {inp.abs().max():.2e}], " f"any_nan={torch.isnan(inp).any()}") print(f" Output range: [{output.abs().min():.2e}, {output.abs().max():.2e}]") return hook for name, module in self.model.named_modules(): if len(list(module.children())) == 0: # leaf modules handle = module.register_forward_hook(make_hook(name)) self.hooks.append(handle) def cleanup(self): for h in self.hooks: h.remove() self.hooks = [] # Usagehunter = NaNHunter(model)hunter.install()try: out = model(x) loss = criterion(out, target) loss.backward()finally: hunter.cleanup()Production-grade NaN hunting toolkit.
8. Gradient Inspection Patterns#
Production training'de gradient'ler hakkında bilmeniz gerekenler:
Per-layer gradient norm#
def log_gradient_norms(model, step): total_norm = 0.0 layer_norms = {} for name, p in model.named_parameters(): if p.grad is not None: n = p.grad.norm().item() layer_norms[f"grad/{name}"] = n total_norm += n ** 2 layer_norms["grad/total"] = total_norm ** 0.5 # wandb veya tensorboard wandb.log({**layer_norms, "step": step})
Healthy ranges (Llama-style)#
- First layer (embedding): 0.01 - 0.5
- Middle layers (transformer): 0.001 - 0.1
- Last layer (head): 0.01 - 1.0
- Total: 0.1 - 5.0
Anomaly patterns#
| Pattern | Olası sebep |
|---|---|
| Total norm > 100 | Exploding gradient — clip more |
| Embedding norm >> others | Tokenization issue, OOV tokens |
| Late layers >> early | Vanishing gradient (rare modern arch) |
| One layer abnormally high | Specific issue in that layer |
| All ~0 | Frozen model (intended?) or vanishing |
Gradient histogram#
for name, p in model.named_parameters(): if p.grad is not None: hist = torch.histc(p.grad.flatten(), bins=20) wandb.log({f"grad_hist/{name}": wandb.Histogram(np_histogram=(hist.cpu().numpy(), ...))})
Distribution shape: heavy-tail → potential issue.
Update-to-weight ratio#
for name, p in model.named_parameters(): if p.grad is not None: update_norm = (lr * p.grad).norm().item() weight_norm = p.norm().item() ratio = update_norm / (weight_norm + 1e-8) wandb.log({f"ratio/{name}": ratio})
Healthy: 1e-3 to 1e-2. Çok yüksek (>1) → lr çok büyük. Çok düşük (<1e-5) → model converged or vanishing.
9. Repro Patterns — Bug Reproduction#
Production bug'ı lokal'de tekrarlayabilmek debug'ın temeli.
Minimum repro recipe#
# repro.py import torch torch.manual_seed(42) torch.cuda.manual_seed_all(42) torch.use_deterministic_algorithms(True) # Simplified model (full model değil) class MiniModel(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(100, 100) def forward(self, x): # Same operations as production return torch.softmax(self.linear(x), dim=-1) model = MiniModel().cuda() x = torch.randn(4, 100, device="cuda", dtype=torch.bfloat16) # Same op'ları çalıştır with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16): out = model(x) print(out) # Expected NaN or bug
Repro best practices#
- Smallest input: 4096×4096 değil 16×16
- Single GPU: distributed elimine et önce
- No fancy stuff: torch.compile, FSDP off — sade
- Deterministic: seed + algorithms
- Save snapshot: bug öncesi state'i pickle olarak
- Git commit hash: hangi versiyonda
Issue reporting#
GitHub'da issue açıyorsan:
- PyTorch version
- CUDA version
- GPU model
- OS
- Minimal repro script
- Expected vs actual
- Stack trace
PyTorch maintainer'lar bu format'ı seviyor.
10. Production Debug Workflow#
Production'da iş bozulduğunda systematic flow:
Step 1: Verify problem#
- Hata gerçek mi yoksa flaky test mi?
- Specific batch mi yoksa systematic mi?
- Hangi step'te başladı?
Step 2: Snapshot#
- Last good checkpoint ne zaman?
- Bad batch identify et (logla)
- Memory snapshot (Modül 5.3)
- Gradient stats o ana kadar
Step 3: Repro lokal#
- Minimum example yarat (yukarıdaki recipe)
- Deterministic mode
- Single GPU
Step 4: Anomaly mode#
torch.autograd.set_detect_anomaly(True)- NaN hunter install
- First failing module'ü identify et
Step 5: Root cause#
- Activation explosion?
- Gradient overflow?
- Mixed precision underflow?
- Bad input data?
- Code bug (recent change)?
Step 6: Fix + verify#
- Fix uygula
- Repro test edilebilir mi?
- Production rerun
Step 7: Prevention#
- Test ekle (specific repro case)
- Monitor improve et (anomaly detection earlier)
- Documentation (post-mortem)
Modern tooling#
- wandb/tensorboard: gradient stats real-time
- Langfuse: LLM-specific tracing (Modül 48)
- Sentry: error tracking + alerting
- NCCL_DEBUG: distributed comm debugging
LLM training mühendisinin günlük arsenal'ı.
11. Mini Egzersizler#
-
Hook practice: 12-layer transformer'da activation magnitude'ları yazdır. Hangi layer en büyük?
-
NaN hunting: Model'inde NaN var. Sequential debug adımları?
-
Reproducibility: Aynı seed ile iki run farklı sonuç veriyor. Olası sebepler?
-
Benchmarking:ile model 5ms,
time.perf_counter()ile 8ms. Niye fark?torch.utils.benchmark -
Production scenario: Step 10000'de loss NaN. İlk 5 dakikada ne yaparsın?
Bu Derste Neler Öğrendik?#
✓ Forward hooks — module çıktısı intercept, activation stats
✓ Backward hooks — gradient flow inspection, vanishing/exploding detect
✓ Tensor hooks — bireysel parameter gradient modify
✓ Anomaly detection mode — built-in NaN debugger
✓ Deterministic training — reproducibility (seeds + algorithms)
✓ torch.utils.benchmark — precise CUDA-aware timing
✓ NaN avı systematik (NaNHunter class pattern)
✓ Gradient inspection — per-layer norm, ratio, histogram
✓ Repro patterns — minimum example recipe
✓ Production debug workflow — 7-step systematic
Sıradaki Ders#
5.8 — Production Engineering: Reproducibility, Determinism, CI/CD for ML
PyTorch mühendisliğinin son dersi — production workflow patterns: ML CI/CD pipelines, eval harness CI'a integration, model versioning (DVC, MLflow), canary deployment, rollback strategies, prompt + model + data versioning.
Frequently Asked Questions
**Performance**: 5-10x slowdown. Every operation has gradient flow check, stack trace caching. A 10-day production training run becomes 50-100 days — unacceptable. Practical: production normal mode, anomaly mode opened **for debugging** (small repro). Modern PyTorch teams use \`detect_anomaly\` context manager only for suspect regions.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup