Skip to content

Debug Arsenal: register_hook, Anomaly Mode, torch.utils.benchmark — Production Debugging Toolkit

Toolkit for when things break in production PyTorch: forward/backward hooks, anomaly detection mode, deterministic training, torch.utils.benchmark precise timing, repro patterns, systematic NaN hunting, gradient inspection, model debugging strategies.

Şükrü Yusuf KAYA
55 min read
Intermediate
Debug Arsenal: register_hook, Anomaly Mode, torch.utils.benchmark — Production Debugging Toolkit
🔧 'Çalıştı bende' demek profesyonel değil
LLM mühendisliğinde iş bozulur. Loss spike, NaN gradient, accuracy drop, OOM. Senior mühendisi junior'dan ayıran şey: systematic debugging. Bu ders production-grade debug toolkit. 55 dakika sonra: hooks ile anything inspect etmek, anomaly mode ile NaN kaynaklarını bulmak, deterministic training ile reproducible bugs, precise benchmarking — hepsi pratik olacak.

Ders Haritası#

  1. Forward hooks — module çıktısını intercept
  2. Backward hooks — gradient flow inspect
  3. Tensor hooks — bireysel tensor gradient
  4. Anomaly detection mode
  5. Deterministic training — reproducibility
  6. torch.utils.benchmark — precise timing
  7. NaN avı systematik
  8. Gradient inspection patterns
  9. Repro patterns — bug reproduction
  10. Production debug workflow

1. Forward Hooks — Module Çıktısını Intercept#

PyTorch
Module.register_forward_hook
ile her forward'da custom function çalıştır.
def forward_hook(module, input, output): # input: tuple, module'a giren tensor'lar # output: module'un çıktısı print(f"{module.__class__.__name__}: out shape={output.shape}, mean={output.mean():.4f}") # Tüm Linear layer'lara hook ekle for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): module.register_forward_hook(forward_hook)

Use cases#

1. Activation statistics

activations = {} def save_activation(name): def hook(module, input, output): activations[name] = { "mean": output.mean().item(), "std": output.std().item(), "max": output.abs().max().item(), } return hook for name, module in model.named_modules(): if isinstance(module, (torch.nn.Linear, torch.nn.LayerNorm)): module.register_forward_hook(save_activation(name)) # Forward pass çalıştır out = model(x) # Inspect for name, stats in activations.items(): print(f"{name}: {stats}")
Hangi katmanda activation explosion var? Hook ile direkt görüyorsun.

2. Feature extraction

features = {} def get_features(name): def hook(module, input, output): features[name] = output.detach() return hook model.encoder.layer_5.register_forward_hook(get_features("layer_5")) out = model(x) intermediate = features["layer_5"]
ELMo, attention visualization gibi pattern'ler.

3. NaN/Inf detection

def nan_check_hook(module, input, output): if torch.isnan(output).any() or torch.isinf(output).any(): print(f"⚠️ NaN/Inf in {module.__class__.__name__}!") print(f" Input ranges: {[i.abs().max().item() for i in input if isinstance(i, torch.Tensor)]}")
NaN ortaya çıkan ilk layer'ı tespit.

Hook'u kaldırma#

handle = module.register_forward_hook(my_hook) # ... use ... handle.remove() # cleanup
Unutmazsan memory leak.

2. Backward Hooks — Gradient Flow Inspect#

Module.register_full_backward_hook
— gradient flow'u inspect.
def backward_hook(module, grad_input, grad_output): # grad_input: module'a giren gradient # grad_output: module'un çıktısının gradient'i print(f"{module.__class__.__name__}: " f"grad_out_norm={grad_output[0].norm():.4f}, " f"grad_in_norm={grad_input[0].norm() if grad_input[0] is not None else 'None':.4f}") for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): module.register_full_backward_hook(backward_hook)

Use cases#

1. Vanishing/Exploding gradient detection

grad_stats = {} def check_gradient(name): def hook(module, grad_input, grad_output): grad_stats[name] = { "out_norm": grad_output[0].norm().item(), "in_norm": grad_input[0].norm().item() if grad_input[0] is not None else None, } return hook for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): module.register_full_backward_hook(check_gradient(name)) loss.backward() # Layer-by-layer gradient norm for name, stats in grad_stats.items(): print(f"{name}: out_norm={stats['out_norm']:.6f}")
Çok küçük gradient (1e-10) → vanishing. Çok büyük (1e10) → exploding.

2. Gradient sanity check

def assert_finite_grad(module, grad_input, grad_output): for g in grad_input + grad_output: if g is not None and not torch.isfinite(g).all(): raise RuntimeError(f"NaN/Inf gradient in {module.__class__.__name__}")
Production training'de erken NaN tespit.

Eski API (deprecated)#

register_backward_hook
(full değil) hata-prone. PyTorch 1.8+'de
register_full_backward_hook
kullan.

3. Tensor Hooks — Bireysel Tensor Gradient#

Module-level değil, tensor-level gradient inspection.
x = torch.randn(10, requires_grad=True) y = x * 2 z = y.sum() # x'in gradient'inde modify et def x_grad_hook(grad): print(f"x.grad: {grad}") return grad.clamp(-1, 1) # gradient clipping in-place x.register_hook(x_grad_hook) z.backward() # x.grad: tensor([2., 2., ..., 2.]) (clipped)

Pratik kullanım#

1. Specific parameter monitoring

# Çok büyük bir layer'ın specific weight'inin gradient'ini track attention_qkv_weight = model.transformer.h[0].attn.c_attn.weight attention_qkv_weight.register_hook(lambda g: print(f"QKV grad norm: {g.norm():.4f}"))

2. Gradient debugging

def trace_gradient(name): def hook(grad): if torch.isnan(grad).any(): print(f"⚠️ NaN in {name}.grad") # Optionally: set to zero return torch.zeros_like(grad) return grad return hook for name, param in model.named_parameters(): param.register_hook(trace_gradient(name))

Hook return değeri#

  • None
    : original gradient kullanılır
  • Tensor
    : yeni gradient (replace)
Gradient modify etmek için return value.

4. Anomaly Detection Mode#

PyTorch'un built-in NaN debugger:
import torch.autograd torch.autograd.set_detect_anomaly(True) try: out = model(x) loss = criterion(out, target) loss.backward() except RuntimeError as e: print(f"Anomaly detected: {e}") # Detailed stack trace gösterilir

Ne tespit ediyor?#

  • NaN gradients — hangi operation'da
  • Inf gradients
  • Modified leaf tensor (autograd issue)
  • In-place operation problems

Performance overhead#

~5-10x slower — debug için, production'da kapalı kalmalı.

Context manager#

with torch.autograd.detect_anomaly(): loss = compute_loss(x, model) loss.backward()
Sadece bu region için debug. Performance impact minimize.

Typical workflow#

  1. Production training: NaN gözlemledin
  2. Repro small example (deterministic seed)
  3. Anomaly mode aç + repro çalıştır
  4. Detailed stack trace → hangi operation
  5. Root cause fix, anomaly mode kapat
  6. Production'da hâlâ NaN olursa: dataset-specific edge case investigation

5. Deterministic Training — Reproducibility#

LLM training'de "neden bu bug oluştu?" cevaplamak için bit-exact reproducibility gerek.

Random seeds set#

import random import numpy as np import torch def set_seed(seed): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) set_seed(42)

Deterministic algorithms#

torch.use_deterministic_algorithms(True) # Bazı CUDA ops'un deterministic versionunu zorla
Bazı kernel'lar non-deterministic by default (performans için). Bunu zorla → ~%10-30 yavaşlama.

CUDA-specific#

export CUBLAS_WORKSPACE_CONFIG=:4096:8 # determinism için gerekli
PyTorch error verir bu set olmamışsa.

Determinism trade-off#

ModeSpeedReproducibility
DefaultHızlıAynı seed → benzer ama bit-exact değil
Deterministic%10-30 yavaşBit-exact

Pratik kullanım#

  • Development: deterministic mode (debug için)
  • Production: default mode (speed için)
  • Repro a bug: deterministic + same seed

Distributed determinism#

Multi-GPU'da gradient reduce order non-deterministic → bit-exact zor. Çözüm:
NCCL_BLOCKING_WAIT=1
+ deterministic mode + specific config. Frontier lab'lar bunu yapıyor critical debugging için.

6. torch.utils.benchmark — Precise Timing#

time.perf_counter()
GPU async non-aware. Wrong measurements.
torch.utils.benchmark
doğru.
from torch.utils.benchmark import Timer t = Timer( stmt="model(x)", globals={"model": model, "x": x}, num_threads=1, ).timeit(100) print(f"Mean: {t.mean*1000:.3f} ms") print(f"Median: {t.median*1000:.3f} ms") print(f"IQR: {(t.iqr.high - t.iqr.low)*1000:.3f} ms")

Compare benchmarks#

from torch.utils.benchmark import Compare results = [] for size in [128, 256, 512, 1024]: for impl in ["eager", "compiled"]: x = torch.randn(32, size, device="cuda") m = compiled_model if impl == "compiled" else model results.append( Timer( stmt="m(x)", globals={"m": m, "x": x}, label="forward", description=impl, sub_label=f"size={size}", ).blocked_autorange() ) compare = Compare(results) compare.print()
Tablo halinde karşılaştırma — eager vs compiled, farklı size'lar.

Auto-thread tuning#

t = Timer(stmt="...", num_threads=torch.get_num_threads()).blocked_autorange()
blocked_autorange()
istatistiksel anlamlı sample sayısı otomatik.

CUDA-aware#

PyTorch benchmark CUDA sync otomatik handle — accurate GPU timing.

Practical patterns#

  1. Optimization regression: PR'dan önce/sonra benchmark
  2. Hardware comparison: H100 vs A100 same model
  3. Implementation comparison: PyTorch native vs custom Triton
  4. Mixed precision: BF16 vs FP32 vs FP8
python
# Production NaN hunter — comprehensive
import torch
 
class NaNHunter:
def __init__(self, model):
self.model = model
self.hooks = []
self.first_nan_module = None
 
def install(self):
def make_hook(name):
def hook(module, input, output):
if self.first_nan_module is not None:
return # already found
if isinstance(output, torch.Tensor) and not torch.isfinite(output).all():
self.first_nan_module = name
print(f"⚠️ NaN/Inf FIRST appears at: {name}")
print(f" Module: {module.__class__.__name__}")
for i, inp in enumerate(input):
if isinstance(inp, torch.Tensor):
print(f" Input {i}: shape={inp.shape}, "
f"range=[{inp.abs().min():.2e}, {inp.abs().max():.2e}], "
f"any_nan={torch.isnan(inp).any()}")
print(f" Output range: [{output.abs().min():.2e}, {output.abs().max():.2e}]")
return hook
 
for name, module in self.model.named_modules():
if len(list(module.children())) == 0: # leaf modules
handle = module.register_forward_hook(make_hook(name))
self.hooks.append(handle)
 
def cleanup(self):
for h in self.hooks:
h.remove()
self.hooks = []
 
 
# Usage
hunter = NaNHunter(model)
hunter.install()
try:
out = model(x)
loss = criterion(out, target)
loss.backward()
finally:
hunter.cleanup()
Production-grade NaN hunting toolkit.

8. Gradient Inspection Patterns#

Production training'de gradient'ler hakkında bilmeniz gerekenler:

Per-layer gradient norm#

def log_gradient_norms(model, step): total_norm = 0.0 layer_norms = {} for name, p in model.named_parameters(): if p.grad is not None: n = p.grad.norm().item() layer_norms[f"grad/{name}"] = n total_norm += n ** 2 layer_norms["grad/total"] = total_norm ** 0.5 # wandb veya tensorboard wandb.log({**layer_norms, "step": step})

Healthy ranges (Llama-style)#

  • First layer (embedding): 0.01 - 0.5
  • Middle layers (transformer): 0.001 - 0.1
  • Last layer (head): 0.01 - 1.0
  • Total: 0.1 - 5.0

Anomaly patterns#

PatternOlası sebep
Total norm > 100Exploding gradient — clip more
Embedding norm >> othersTokenization issue, OOV tokens
Late layers >> earlyVanishing gradient (rare modern arch)
One layer abnormally highSpecific issue in that layer
All ~0Frozen model (intended?) or vanishing

Gradient histogram#

for name, p in model.named_parameters(): if p.grad is not None: hist = torch.histc(p.grad.flatten(), bins=20) wandb.log({f"grad_hist/{name}": wandb.Histogram(np_histogram=(hist.cpu().numpy(), ...))})
Distribution shape: heavy-tail → potential issue.

Update-to-weight ratio#

for name, p in model.named_parameters(): if p.grad is not None: update_norm = (lr * p.grad).norm().item() weight_norm = p.norm().item() ratio = update_norm / (weight_norm + 1e-8) wandb.log({f"ratio/{name}": ratio})
Healthy: 1e-3 to 1e-2. Çok yüksek (>1) → lr çok büyük. Çok düşük (<1e-5) → model converged or vanishing.

9. Repro Patterns — Bug Reproduction#

Production bug'ı lokal'de tekrarlayabilmek debug'ın temeli.

Minimum repro recipe#

# repro.py import torch torch.manual_seed(42) torch.cuda.manual_seed_all(42) torch.use_deterministic_algorithms(True) # Simplified model (full model değil) class MiniModel(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(100, 100) def forward(self, x): # Same operations as production return torch.softmax(self.linear(x), dim=-1) model = MiniModel().cuda() x = torch.randn(4, 100, device="cuda", dtype=torch.bfloat16) # Same op'ları çalıştır with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16): out = model(x) print(out) # Expected NaN or bug

Repro best practices#

  1. Smallest input: 4096×4096 değil 16×16
  2. Single GPU: distributed elimine et önce
  3. No fancy stuff: torch.compile, FSDP off — sade
  4. Deterministic: seed + algorithms
  5. Save snapshot: bug öncesi state'i pickle olarak
  6. Git commit hash: hangi versiyonda

Issue reporting#

GitHub'da issue açıyorsan:
  • PyTorch version
  • CUDA version
  • GPU model
  • OS
  • Minimal repro script
  • Expected vs actual
  • Stack trace
PyTorch maintainer'lar bu format'ı seviyor.

10. Production Debug Workflow#

Production'da iş bozulduğunda systematic flow:

Step 1: Verify problem#

  • Hata gerçek mi yoksa flaky test mi?
  • Specific batch mi yoksa systematic mi?
  • Hangi step'te başladı?

Step 2: Snapshot#

  • Last good checkpoint ne zaman?
  • Bad batch identify et (logla)
  • Memory snapshot (Modül 5.3)
  • Gradient stats o ana kadar

Step 3: Repro lokal#

  • Minimum example yarat (yukarıdaki recipe)
  • Deterministic mode
  • Single GPU

Step 4: Anomaly mode#

  • torch.autograd.set_detect_anomaly(True)
  • NaN hunter install
  • First failing module'ü identify et

Step 5: Root cause#

  • Activation explosion?
  • Gradient overflow?
  • Mixed precision underflow?
  • Bad input data?
  • Code bug (recent change)?

Step 6: Fix + verify#

  • Fix uygula
  • Repro test edilebilir mi?
  • Production rerun

Step 7: Prevention#

  • Test ekle (specific repro case)
  • Monitor improve et (anomaly detection earlier)
  • Documentation (post-mortem)

Modern tooling#

  • wandb/tensorboard: gradient stats real-time
  • Langfuse: LLM-specific tracing (Modül 48)
  • Sentry: error tracking + alerting
  • NCCL_DEBUG: distributed comm debugging
LLM training mühendisinin günlük arsenal'ı.

11. Mini Egzersizler#

  1. Hook practice: 12-layer transformer'da activation magnitude'ları yazdır. Hangi layer en büyük?
  2. NaN hunting: Model'inde NaN var. Sequential debug adımları?
  3. Reproducibility: Aynı seed ile iki run farklı sonuç veriyor. Olası sebepler?
  4. Benchmarking:
    time.perf_counter()
    ile model 5ms,
    torch.utils.benchmark
    ile 8ms. Niye fark?
  5. Production scenario: Step 10000'de loss NaN. İlk 5 dakikada ne yaparsın?

Bu Derste Neler Öğrendik?#

Forward hooks — module çıktısı intercept, activation stats ✓ Backward hooks — gradient flow inspection, vanishing/exploding detect ✓ Tensor hooks — bireysel parameter gradient modify ✓ Anomaly detection mode — built-in NaN debugger ✓ Deterministic training — reproducibility (seeds + algorithms) ✓ torch.utils.benchmark — precise CUDA-aware timing ✓ NaN avı systematik (NaNHunter class pattern) ✓ Gradient inspection — per-layer norm, ratio, histogram ✓ Repro patterns — minimum example recipe ✓ Production debug workflow — 7-step systematic

Sıradaki Ders#

5.8 — Production Engineering: Reproducibility, Determinism, CI/CD for ML PyTorch mühendisliğinin son dersi — production workflow patterns: ML CI/CD pipelines, eval harness CI'a integration, model versioning (DVC, MLflow), canary deployment, rollback strategies, prompt + model + data versioning.

Frequently Asked Questions

**Performance**: 5-10x slowdown. Every operation has gradient flow check, stack trace caching. A 10-day production training run becomes 50-100 days — unacceptable. Practical: production normal mode, anomaly mode opened **for debugging** (small repro). Modern PyTorch teams use \`detect_anomaly\` context manager only for suspect regions.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content