Is multi-GPU determinism really possible?

Hard but **possible**. Additional requirements: (1) \`NCCL_BLOCKING_WAIT=1\` — sync NCCL ops. (2) \`torch.use_deterministic_algorithms(True)\`. (3) Same data order on each GPU (DistributedSampler seed). (4) \`gradient_as_bucket_view=False\` (some non-deterministic behavior). (5) Allreduce algorithm: \`NCCL_ALGO=Ring\` deterministic. Performance loss ~20-40%. Frontier labs use it for critical bug repro — not in daily training.

Do hooks cause memory leaks?

**Can cause leaks**, if not careful. Patterns: (1) Register hook but forget to remove → reference cycle → memory retained. (2) Hook saves activations to dict → dict grows and grows. (3) In distributed training, hooks register in every process — 8x replication. **Solutions**: cleanup in \`finally\` block, use weak references, install hooks only in debug mode. Make hooks conditional in production.

Difference between torch.utils.benchmark and cProfile?

Different purposes. **torch.utils.benchmark**: GPU-aware wall-clock timing for small code snippets. **cProfile**: Python-level profiling, function call breakdown. **Nsight Systems**: full GPU + CPU timeline. Use cases: (1) Kernel A vs B comparison → torch.utils.benchmark. (2) Python bottleneck → cProfile. (3) Multi-GPU communication overlap → Nsight. For PyTorch ML engineer: 80% torch.utils.benchmark, 15% Nsight, 5% cProfile.

Wandb vs Tensorboard vs Langfuse — which for debugging?

Use-case dependent. **Wandb**: ML training metrics (loss, gradient norm, learning rate) — easy setup, beautiful dashboards. **Tensorboard**: PyTorch native, simple + free, no advanced features. **Langfuse**: LLM-specific tracing (prompts, completions, costs, hallucination metrics) — production inference focused. **Practical**: training-time metrics → wandb. Production LLM observability → Langfuse. Tensorboard for legacy projects. Module 48 (Observability) details.

Why are distributed training bugs harder to debug?

Several reasons: (1) **Asynchronous**: GPUs in different states at different times → hard to reproduce. (2) **Communication errors**: NCCL error messages cryptic. (3) **Race conditions**: some patterns only at specific timings. (4) **Limited debugger**: pdb is single process, distributed is multi-process. (5) **Output difference**: rank-0 print loses info. **Strategies**: (a) Single-GPU repro first. (b) NCCL_DEBUG=INFO. (c) Separate logger for each rank. (d) Smaller cluster test. (e) Frequent checkpoint. Module 17 details.

Debug Arsenal: register_hook, Anomaly Mode, torch.utils.benchmark — Production Debugging Toolkit

Q: Why should anomaly mode stay off in production?

**Performance**: 5-10x slowdown. Every operation has gradient flow check, stack trace caching. A 10-day production training run becomes 50-100 days — unacceptable. Practical: production normal mode, anomaly mode opened **for debugging** (small repro). Modern PyTorch teams use \`detect_anomaly\` context manager only for suspect regions.

Toolkit for when things break in production PyTorch: forward/backward hooks, anomaly detection mode, deterministic training, torch.utils.benchmark precise timing, repro patterns, systematic NaN hunting, gradient inspection, model debugging strategies.

Şükrü Yusuf KAYA

55 min read

5/13/2026

Intermediate

Debug Arsenal: register_hook, Anomaly Mode, torch.utils.benchmark — Production Debugging Toolkit

🔧 'Çalıştı bende' demek profesyonel değil

LLM mühendisliğinde iş bozulur. Loss spike, NaN gradient, accuracy drop, OOM. Senior mühendisi junior'dan ayıran şey: systematic debugging. Bu ders production-grade debug toolkit. 55 dakika sonra: hooks ile anything inspect etmek, anomaly mode ile NaN kaynaklarını bulmak, deterministic training ile reproducible bugs, precise benchmarking — hepsi pratik olacak.

Ders Haritası#

Forward hooks — module çıktısını intercept
Backward hooks — gradient flow inspect
Tensor hooks — bireysel tensor gradient
Anomaly detection mode
Deterministic training — reproducibility
torch.utils.benchmark — precise timing
NaN avı systematik
Gradient inspection patterns
Repro patterns — bug reproduction
Production debug workflow

1. Forward Hooks — Module Çıktısını Intercept#

PyTorch

Module.register_forward_hook

ile her forward'da custom function çalıştır.

def forward_hook(module, input, output):
    # input: tuple, module'a giren tensor'lar
    # output: module'un çıktısı
    print(f"{module.__class__.__name__}: out shape={output.shape}, mean={output.mean():.4f}")

# Tüm Linear layer'lara hook ekle
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        module.register_forward_hook(forward_hook)

Use cases#

1. Activation statistics

activations = {}

def save_activation(name):
    def hook(module, input, output):
        activations[name] = {
            "mean": output.mean().item(),
            "std": output.std().item(),
            "max": output.abs().max().item(),
        }
    return hook

for name, module in model.named_modules():
    if isinstance(module, (torch.nn.Linear, torch.nn.LayerNorm)):
        module.register_forward_hook(save_activation(name))

# Forward pass çalıştır
out = model(x)

# Inspect
for name, stats in activations.items():
    print(f"{name}: {stats}")

Hangi katmanda activation explosion var? Hook ile direkt görüyorsun.

2. Feature extraction

features = {}

def get_features(name):
    def hook(module, input, output):
        features[name] = output.detach()
    return hook

model.encoder.layer_5.register_forward_hook(get_features("layer_5"))
out = model(x)
intermediate = features["layer_5"]

ELMo, attention visualization gibi pattern'ler.

3. NaN/Inf detection

def nan_check_hook(module, input, output):
    if torch.isnan(output).any() or torch.isinf(output).any():
        print(f"⚠️ NaN/Inf in {module.__class__.__name__}!")
        print(f"   Input ranges: {[i.abs().max().item() for i in input if isinstance(i, torch.Tensor)]}")

NaN ortaya çıkan ilk layer'ı tespit.

Hook'u kaldırma#

handle = module.register_forward_hook(my_hook)
# ... use ...
handle.remove()  # cleanup

Unutmazsan memory leak.

2. Backward Hooks — Gradient Flow Inspect#

Module.register_full_backward_hook

— gradient flow'u inspect.

def backward_hook(module, grad_input, grad_output):
    # grad_input: module'a giren gradient
    # grad_output: module'un çıktısının gradient'i
    print(f"{module.__class__.__name__}: "
          f"grad_out_norm={grad_output[0].norm():.4f}, "
          f"grad_in_norm={grad_input[0].norm() if grad_input[0] is not None else 'None':.4f}")

for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        module.register_full_backward_hook(backward_hook)

Use cases#

1. Vanishing/Exploding gradient detection

grad_stats = {}

def check_gradient(name):
    def hook(module, grad_input, grad_output):
        grad_stats[name] = {
            "out_norm": grad_output[0].norm().item(),
            "in_norm": grad_input[0].norm().item() if grad_input[0] is not None else None,
        }
    return hook

for name, module in model.named_modules():
    if isinstance(module, torch.nn.Linear):
        module.register_full_backward_hook(check_gradient(name))

loss.backward()

# Layer-by-layer gradient norm
for name, stats in grad_stats.items():
    print(f"{name}: out_norm={stats['out_norm']:.6f}")

Çok küçük gradient (1e-10) → vanishing. Çok büyük (1e10) → exploding.

2. Gradient sanity check

def assert_finite_grad(module, grad_input, grad_output):
    for g in grad_input + grad_output:
        if g is not None and not torch.isfinite(g).all():
            raise RuntimeError(f"NaN/Inf gradient in {module.__class__.__name__}")

Production training'de erken NaN tespit.

Eski API (deprecated)#

register_backward_hook

(full değil) hata-prone. PyTorch 1.8+'de

register_full_backward_hook

kullan.

3. Tensor Hooks — Bireysel Tensor Gradient#

Module-level değil, tensor-level gradient inspection.

x = torch.randn(10, requires_grad=True)
y = x * 2
z = y.sum()

# x'in gradient'inde modify et
def x_grad_hook(grad):
    print(f"x.grad: {grad}")
    return grad.clamp(-1, 1)  # gradient clipping in-place

x.register_hook(x_grad_hook)

z.backward()
# x.grad: tensor([2., 2., ..., 2.])  (clipped)

Pratik kullanım#

1. Specific parameter monitoring

# Çok büyük bir layer'ın specific weight'inin gradient'ini track
attention_qkv_weight = model.transformer.h[0].attn.c_attn.weight
attention_qkv_weight.register_hook(lambda g: print(f"QKV grad norm: {g.norm():.4f}"))

2. Gradient debugging

def trace_gradient(name):
    def hook(grad):
        if torch.isnan(grad).any():
            print(f"⚠️ NaN in {name}.grad")
            # Optionally: set to zero
            return torch.zeros_like(grad)
        return grad
    return hook

for name, param in model.named_parameters():
    param.register_hook(trace_gradient(name))

Hook return değeri#

None
: original gradient kullanılır
Tensor
: yeni gradient (replace)

Gradient modify etmek için return value.

4. Anomaly Detection Mode#

PyTorch'un built-in NaN debugger:

import torch.autograd

torch.autograd.set_detect_anomaly(True)

try:
    out = model(x)
    loss = criterion(out, target)
    loss.backward()
except RuntimeError as e:
    print(f"Anomaly detected: {e}")
    # Detailed stack trace gösterilir

Ne tespit ediyor?#

NaN gradients — hangi operation'da
Inf gradients
Modified leaf tensor (autograd issue)
In-place operation problems

Performance overhead#

~5-10x slower — debug için, production'da kapalı kalmalı.

Context manager#

with torch.autograd.detect_anomaly():
    loss = compute_loss(x, model)
    loss.backward()

Sadece bu region için debug. Performance impact minimize.

Typical workflow#

Production training: NaN gözlemledin
Repro small example (deterministic seed)
Anomaly mode aç + repro çalıştır
Detailed stack trace → hangi operation
Root cause fix, anomaly mode kapat
Production'da hâlâ NaN olursa: dataset-specific edge case investigation

5. Deterministic Training — Reproducibility#

LLM training'de "neden bu bug oluştu?" cevaplamak için bit-exact reproducibility gerek.

Random seeds set#

import random
import numpy as np
import torch

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(42)

Deterministic algorithms#

torch.use_deterministic_algorithms(True)
# Bazı CUDA ops'un deterministic versionunu zorla

Bazı kernel'lar non-deterministic by default (performans için). Bunu zorla → ~%10-30 yavaşlama.

CUDA-specific#

export CUBLAS_WORKSPACE_CONFIG=:4096:8  # determinism için gerekli

PyTorch error verir bu set olmamışsa.

Determinism trade-off#

Mode	Speed	Reproducibility
Default	Hızlı	Aynı seed → benzer ama bit-exact değil
Deterministic	%10-30 yavaş	Bit-exact

Pratik kullanım#

Development: deterministic mode (debug için)
Production: default mode (speed için)
Repro a bug: deterministic + same seed

Distributed determinism#

Multi-GPU'da gradient reduce order non-deterministic → bit-exact zor. Çözüm:

NCCL_BLOCKING_WAIT=1

+ deterministic mode + specific config. Frontier lab'lar bunu yapıyor critical debugging için.

6. torch.utils.benchmark — Precise Timing#

time.perf_counter()

GPU async non-aware. Wrong measurements.

torch.utils.benchmark

doğru.

from torch.utils.benchmark import Timer

t = Timer(
    stmt="model(x)",
    globals={"model": model, "x": x},
    num_threads=1,
).timeit(100)

print(f"Mean: {t.mean*1000:.3f} ms")
print(f"Median: {t.median*1000:.3f} ms")
print(f"IQR: {(t.iqr.high - t.iqr.low)*1000:.3f} ms")

Compare benchmarks#

from torch.utils.benchmark import Compare

results = []
for size in [128, 256, 512, 1024]:
    for impl in ["eager", "compiled"]:
        x = torch.randn(32, size, device="cuda")
        m = compiled_model if impl == "compiled" else model
        results.append(
            Timer(
                stmt="m(x)",
                globals={"m": m, "x": x},
                label="forward",
                description=impl,
                sub_label=f"size={size}",
            ).blocked_autorange()
        )

compare = Compare(results)
compare.print()

Tablo halinde karşılaştırma — eager vs compiled, farklı size'lar.

Auto-thread tuning#

t = Timer(stmt="...", num_threads=torch.get_num_threads()).blocked_autorange()

blocked_autorange()

istatistiksel anlamlı sample sayısı otomatik.

CUDA-aware#

PyTorch benchmark CUDA sync otomatik handle — accurate GPU timing.

Practical patterns#

Optimization regression: PR'dan önce/sonra benchmark
Hardware comparison: H100 vs A100 same model
Implementation comparison: PyTorch native vs custom Triton
Mixed precision: BF16 vs FP32 vs FP8

python

# Production NaN hunter — comprehensive
import torch
 
class NaNHunter:
    def __init__(self, model):
        self.model = model
        self.hooks = []
        self.first_nan_module = None
 
    def install(self):
        def make_hook(name):
            def hook(module, input, output):
                if self.first_nan_module is not None:
                    return  # already found
                if isinstance(output, torch.Tensor) and not torch.isfinite(output).all():
                    self.first_nan_module = name
                    print(f"⚠️ NaN/Inf FIRST appears at: {name}")
                    print(f"   Module: {module.__class__.__name__}")
                    for i, inp in enumerate(input):
                        if isinstance(inp, torch.Tensor):
                            print(f"   Input {i}: shape={inp.shape}, "
                                  f"range=[{inp.abs().min():.2e}, {inp.abs().max():.2e}], "
                                  f"any_nan={torch.isnan(inp).any()}")
                    print(f"   Output range: [{output.abs().min():.2e}, {output.abs().max():.2e}]")
            return hook
 
        for name, module in self.model.named_modules():
            if len(list(module.children())) == 0:  # leaf modules
                handle = module.register_forward_hook(make_hook(name))
                self.hooks.append(handle)
 
    def cleanup(self):
        for h in self.hooks:
            h.remove()
        self.hooks = []
 
 
# Usage
hunter = NaNHunter(model)
hunter.install()
try:
    out = model(x)
    loss = criterion(out, target)
    loss.backward()
finally:
    hunter.cleanup()

Production-grade NaN hunting toolkit.

8. Gradient Inspection Patterns#

Production training'de gradient'ler hakkında bilmeniz gerekenler:

Per-layer gradient norm#

def log_gradient_norms(model, step):
    total_norm = 0.0
    layer_norms = {}
    for name, p in model.named_parameters():
        if p.grad is not None:
            n = p.grad.norm().item()
            layer_norms[f"grad/{name}"] = n
            total_norm += n ** 2
    layer_norms["grad/total"] = total_norm ** 0.5

    # wandb veya tensorboard
    wandb.log({**layer_norms, "step": step})

Healthy ranges (Llama-style)#

First layer (embedding): 0.01 - 0.5
Middle layers (transformer): 0.001 - 0.1
Last layer (head): 0.01 - 1.0
Total: 0.1 - 5.0

Anomaly patterns#

Pattern	Olası sebep
Total norm > 100	Exploding gradient — clip more
Embedding norm >> others	Tokenization issue, OOV tokens
Late layers >> early	Vanishing gradient (rare modern arch)
One layer abnormally high	Specific issue in that layer
All ~0	Frozen model (intended?) or vanishing

Gradient histogram#

for name, p in model.named_parameters():
    if p.grad is not None:
        hist = torch.histc(p.grad.flatten(), bins=20)
        wandb.log({f"grad_hist/{name}": wandb.Histogram(np_histogram=(hist.cpu().numpy(), ...))})

Distribution shape: heavy-tail → potential issue.

Update-to-weight ratio#

for name, p in model.named_parameters():
    if p.grad is not None:
        update_norm = (lr * p.grad).norm().item()
        weight_norm = p.norm().item()
        ratio = update_norm / (weight_norm + 1e-8)
        wandb.log({f"ratio/{name}": ratio})

Healthy: 1e-3 to 1e-2. Çok yüksek (>1) → lr çok büyük. Çok düşük (<1e-5) → model converged or vanishing.

9. Repro Patterns — Bug Reproduction#

Production bug'ı lokal'de tekrarlayabilmek debug'ın temeli.

Minimum repro recipe#

# repro.py
import torch
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
torch.use_deterministic_algorithms(True)

# Simplified model (full model değil)
class MiniModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(100, 100)

    def forward(self, x):
        # Same operations as production
        return torch.softmax(self.linear(x), dim=-1)

model = MiniModel().cuda()
x = torch.randn(4, 100, device="cuda", dtype=torch.bfloat16)

# Same op'ları çalıştır
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    out = model(x)
print(out)  # Expected NaN or bug

Repro best practices#

Smallest input: 4096×4096 değil 16×16
Single GPU: distributed elimine et önce
No fancy stuff: torch.compile, FSDP off — sade
Deterministic: seed + algorithms
Save snapshot: bug öncesi state'i pickle olarak
Git commit hash: hangi versiyonda

Issue reporting#

GitHub'da issue açıyorsan:

PyTorch version
CUDA version
GPU model
OS
Minimal repro script
Expected vs actual
Stack trace

PyTorch maintainer'lar bu format'ı seviyor.

10. Production Debug Workflow#

Production'da iş bozulduğunda systematic flow:

Step 1: Verify problem#

Hata gerçek mi yoksa flaky test mi?
Specific batch mi yoksa systematic mi?
Hangi step'te başladı?

Step 2: Snapshot#

Last good checkpoint ne zaman?
Bad batch identify et (logla)
Memory snapshot (Modül 5.3)
Gradient stats o ana kadar

Step 3: Repro lokal#

Minimum example yarat (yukarıdaki recipe)
Deterministic mode
Single GPU

Step 4: Anomaly mode#

torch.autograd.set_detect_anomaly(True)
NaN hunter install
First failing module'ü identify et

Step 5: Root cause#

Activation explosion?
Gradient overflow?
Mixed precision underflow?
Bad input data?
Code bug (recent change)?

Step 6: Fix + verify#

Fix uygula
Repro test edilebilir mi?
Production rerun

Step 7: Prevention#

Test ekle (specific repro case)
Monitor improve et (anomaly detection earlier)
Documentation (post-mortem)

Modern tooling#

wandb/tensorboard: gradient stats real-time
Langfuse: LLM-specific tracing (Modül 48)
Sentry: error tracking + alerting
NCCL_DEBUG: distributed comm debugging

LLM training mühendisinin günlük arsenal'ı.

11. Mini Egzersizler#

Hook practice: 12-layer transformer'da activation magnitude'ları yazdır. Hangi layer en büyük?
NaN hunting: Model'inde NaN var. Sequential debug adımları?
Reproducibility: Aynı seed ile iki run farklı sonuç veriyor. Olası sebepler?
Benchmarking:
time.perf_counter()
ile model 5ms,
torch.utils.benchmark
ile 8ms. Niye fark?
Production scenario: Step 10000'de loss NaN. İlk 5 dakikada ne yaparsın?

Bu Derste Neler Öğrendik?#

✓ Forward hooks — module çıktısı intercept, activation stats ✓ Backward hooks — gradient flow inspection, vanishing/exploding detect ✓ Tensor hooks — bireysel parameter gradient modify ✓ Anomaly detection mode — built-in NaN debugger ✓ Deterministic training — reproducibility (seeds + algorithms) ✓ torch.utils.benchmark — precise CUDA-aware timing ✓ NaN avı systematik (NaNHunter class pattern) ✓ Gradient inspection — per-layer norm, ratio, histogram ✓ Repro patterns — minimum example recipe ✓ Production debug workflow — 7-step systematic

Sıradaki Ders#

5.8 — Production Engineering: Reproducibility, Determinism, CI/CD for ML PyTorch mühendisliğinin son dersi — production workflow patterns: ML CI/CD pipelines, eval harness CI'a integration, model versioning (DVC, MLflow), canary deployment, rollback strategies, prompt + model + data versioning.

Frequently Asked Questions

**Performance**: 5-10x slowdown. Every operation has gradient flow check, stack trace caching. A 10-day production training run becomes 50-100 days — unacceptable. Practical: production normal mode, anomaly mode opened **for debugging** (small repro). Modern PyTorch teams use \`detect_anomaly\` context manager only for suspect regions.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası#

1. Forward Hooks — Module Çıktısını Intercept#

Use cases#

1. Activation statistics

2. Feature extraction

3. NaN/Inf detection

Hook'u kaldırma#

2. Backward Hooks — Gradient Flow Inspect#

Use cases#

1. Vanishing/Exploding gradient detection

2. Gradient sanity check

Eski API (deprecated)#

3. Tensor Hooks — Bireysel Tensor Gradient#

Pratik kullanım#

1. Specific parameter monitoring

2. Gradient debugging

Hook return değeri#

4. Anomaly Detection Mode#

Ne tespit ediyor?#

Performance overhead#

Context manager#

Typical workflow#

5. Deterministic Training — Reproducibility#

Random seeds set#

Deterministic algorithms#

CUDA-specific#

Determinism trade-off#

Pratik kullanım#

Distributed determinism#

6. torch.utils.benchmark — Precise Timing#

Compare benchmarks#

Auto-thread tuning#

CUDA-aware#

Practical patterns#

8. Gradient Inspection Patterns#

Per-layer gradient norm#

Healthy ranges (Llama-style)#

Anomaly patterns#

Gradient histogram#

Update-to-weight ratio#

9. Repro Patterns — Bug Reproduction#

Minimum repro recipe#

Repro best practices#

Issue reporting#

10. Production Debug Workflow#

Step 1: Verify problem#

Step 2: Snapshot#

Step 3: Repro lokal#

Step 4: Anomaly mode#

Step 5: Root cause#

Step 6: Fix + verify#

Step 7: Prevention#

Modern tooling#

11. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

Why should anomaly mode stay off in production?

Is multi-GPU determinism really possible?

Do hooks cause memory leaks?

Difference between torch.utils.benchmark and cProfile?

Wandb vs Tensorboard vs Langfuse — which for debugging?

Why are distributed training bugs harder to debug?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight