torch.profiler vs nsys hangisi yeterli?

torch.profiler: hızlı, Python-friendly, operator-level. nsys: detaylı, kernel-level, CUDA-stream görselleştirme. Pratikte: hızlı bottleneck tespit için torch.profiler, derin analiz için nsys. Cookbook her ders sonunda torch.profiler trace üretir; nsys sadece Part XIII'te.

WSL2'de nsys çalışır mı?

Sınırlı — GPU profiling NVIDIA driver ≥ 555 + WSL2 kernel ≥ 5.15 gerektirir. Bazı CUDA stream metric'leri eksik. Cookbook profile Lab'larını native Linux'ta yapmayı önerir. Windows native'de nsys + Visual Studio integration da çalışır.

Profiling Stack: torch.profiler + Nsight Systems + Nsight Compute + MFU Calculation

Optimization without profiling is hot air. Python-level timing with torch.profiler, kernel-level timeline with Nsight Systems (nsys), kernel-internal metrics with Nsight Compute (ncu), MFU (Model FLOPs Utilization) calculation. Cookbook certification: each Lab MFU > 35%.

Şükrü Yusuf KAYA

38 min read

5/14/2026

Advanced

Profiling Stack: torch.profiler + Nsight Systems + Nsight Compute + MFU Hesabı

🎯 Bu ders

RTX 4090'ın 165 TFLOPs bf16 teorik gücüne karşı senin gerçek Lab'ın yüzde kaç çalışıyor? Cevap: MFU. Bu ders MFU'yu hesaplamak, profiler ile bottleneck bulmak, kernel'a kadar inmek için lazım.

1. MFU (Model FLOPs Utilization) Nedir?#

MFU = (gerçek model FLOPs / saniye) / (donanımın peak FLOPs / saniye)

Model FLOPs tahmini (transformer için):

F = 6 × N × T

N = parametre sayısı
T = batch'teki toplam token (tüm seq'lerin toplamı)
6 = forward (2N FLOPs/token) + backward (4N FLOPs/token)

Llama 3.1 8B + RTX 4090 (165 TFLOPs peak):

1 step compute (8B, 8K token packed): F = 6 × 8e9 × 8192 = 393 TFLOPs
Bench'ten: 1.78 step/s → 393 × 1.78 = 699 TFLOPs/s gerçek
Bekle, peak 165 TFLOPs/s, gerçek 699?

Hayır — peak 165 TFLOPs/s GPU başına. Hesabımızda 6×N FLOPs/token "training FLOPs"u. Bu hesap aslında bf16 tensor core peak'i ölçer; 165 × MFU% = effective.

Doğru hesap:

Gerçek tokens/s × 6N / theoretical peak FLOPs = MFU
Llama 3.1 8B + 4090 + cookbook config'i: ~10000 tokens/s × 6 × 8e9 / 165e12 = %29 MFU

Cookbook hedefi: MFU > %35 (Unsloth ile). Unsloth fused kernel'lar fiziksel limite yaklaştırır.

Model + config	MFU	Notlar
Naïve HF Trainer	~%20-25	Python overhead, no FA
Cookbook default	%29-35	FA2 + grad-ckpt + pack
Unsloth	%40-50	Triton kernel'lar fused
Theoretical	%100 = 165 TFLOPs/s	erişilemez

python

# === MFU hesabı — cookbook helper ===
import torch, time
 
GPU_PEAK_TFLOPS = {
    "NVIDIA GeForce RTX 4090": 165,         # bf16 dense
    "NVIDIA A100-SXM4-80GB": 312,
    "NVIDIA H100 80GB HBM3": 989,
    "NVIDIA H100 PCIe": 756,
}
 
def measure_mfu(model, tokens_per_sec: float) -> float:
    n_params = sum(p.numel() for p in model.parameters())
    flops_per_token = 6 * n_params           # forward + backward
    achieved_tflops = tokens_per_sec * flops_per_token / 1e12
    gpu = torch.cuda.get_device_name(0)
    peak = GPU_PEAK_TFLOPS.get(gpu, None)
    if peak is None:
        print(f"⚠️ unknown GPU {gpu}; using A100 peak as fallback")
        peak = 312
    mfu = achieved_tflops / peak * 100
    print(f"params={n_params/1e9:.1f}B  achieved={achieved_tflops:.1f} TFLOPs/s  peak={peak} TFLOPs/s  →  MFU={mfu:.1f}%")
    return mfu
 
# Kullanım örneği — bench loop sonunda
# tokens_per_sec = total_tokens / elapsed_time
# measure_mfu(model, tokens_per_sec)

cookbook MFU helper

2. torch.profiler — Python-Level Timing#

İlk adım: hangi Python/operator'ler en çok zaman alıyor?

python

# === torch.profiler örneği ===
from torch.profiler import profile, ProfilerActivity, record_function
 
def train_step(model, batch):
    with record_function("forward"):
        out = model(**batch)
    with record_function("backward"):
        out.loss.backward()
    with record_function("optimizer_step"):
        optimizer.step()
        optimizer.zero_grad()
    return out.loss.item()
 
# Profile 10 step
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
    schedule=torch.profiler.schedule(wait=2, warmup=2, active=6, repeat=1),
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./tb_profiler"),
) as prof:
    for step, batch in enumerate(loader):
        if step >= 10: break
        loss = train_step(model, batch)
        prof.step()
 
# Top-10 CUDA op
print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=10
))
 
# Sonra: tensorboard --logdir tb_profiler
# Browser'da operator breakdown, stack trace, memory timeline gör

torch.profiler ile timeline + memory profiling

3. Nsight Systems (nsys) — Kernel-Level Timeline#

torch.profiler "operator-level" gösterir;

nsys

"kernel-level". Bir kernel niçin uzun çalıştı, hangi memcpy'ler arada, CUDA stream'leri overlapping mi — hepsini görselde gösterir.

bash

# === nsys ile profile al ===
# RTX 4090 + cookbook standart komut
nsys profile \
    --output ftc_llama8b_qlora \
    --trace=cuda,nvtx,osrt,cudnn,cublas \
    --cuda-memory-usage true \
    --capture-range=cudaProfilerApi \
    --cudabacktrace=true \
    --gpu-metrics-device=0 \
    uv run python train.py --max_steps 20
 
# Trace yüklemek için: nsys-ui ftc_llama8b_qlora.nsys-rep
# Veya web UI: https://nsight-systems-web.nvidia.com (rapor yükle)

nsys profile komutu

Nsight Systems'da neye bakacaksın?#

GPU utilization timeline — gap görüyor musun? Gap = idle = bottleneck.
CUDA kernel breakdown — hangi kernel total time'ın %ne kaçında? Llama 8B'de tipik:
- bf16 matmul'lar (
  ampere_bf16_*
  ya da
  cutlass
  ) — %60-70
- FlashAttention (
  fwd_kernel
  /
  bwd_kernel
  ) — %10-20
- LayerNorm/RMSNorm — %3-5
- Embedding lookup — %1-2
memcpy_HtoD / DtoH — host-device transfer'lar varsa minimize edilmeli
NCCL all-reduce (multi-GPU) — overlap edebilir miyiz?
CPU thread'leri — DataLoader'ın python thread'i compute thread'i bekletiyor mu?

Tipik bulgular (cookbook iterations):

Embedding gradient
scatter_add
kernel'i unexpected uzun → CPU offload mı, gerçekten gradient mi gerekiyor?
Optimizer step (
adamw_8bit
) %5'ten fazla → batch size küçük olabilir, paged_adamw'a geç
DataLoader gap'i her step'in %15'i →
num_workers
arttır

4. Nsight Compute (ncu) — Kernel-Internal Metrics#

Kernel niye yavaş? Memory-bound mu, compute-bound mu, occupancy düşük mü?

# Tek bir kernel'a focus et — yoğun olan birini seç
ncu --kernel-name regex:flash_attn_bwd \
    --launch-skip 5 --launch-count 3 \
    --section MemoryWorkloadAnalysis \
    --section ComputeWorkloadAnalysis \
    --section Occupancy \
    -o flash_bwd_report \
    uv run python train.py

ncu-ui flash_bwd_report.ncu-rep

aç. Metrik'ler:

SM Utilization (Compute) — %80+ ideal
Memory Throughput — peak 1008 GB/s (4090)
Achieved Occupancy — %70+ ideal
Bank Conflicts — 0 ideal
Roofline analysis — compute-bound mu memory-bound mu

Cookbook'un kullanım kuralı: ncu'ya sadece Part XIII'te (custom Triton kernel) iniyoruz. Lab'ların çoğunda torch.profiler + nsys yeter.

python

# === MFU benchmark — cookbook'un her Lab'ın sonunda zorunlu ===
import time, torch
from torch.utils.data import DataLoader
 
def bench_mfu(model, loader, n_steps: int = 50):
    """50 step'lik MFU bench. Cookbook sertifika gereksinimi."""
    model.train()
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)
 
    # Warmup
    for i, batch in enumerate(loader):
        if i >= 5: break
        batch = {k: v.to("cuda") for k, v in batch.items()}
        loss = model(**batch).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    torch.cuda.synchronize()
 
    # Bench
    total_tokens = 0
    t0 = time.perf_counter()
    for i, batch in enumerate(loader):
        if i >= n_steps: break
        batch = {k: v.to("cuda") for k, v in batch.items()}
        total_tokens += batch["input_ids"].numel()
        loss = model(**batch).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - t0
 
    tps = total_tokens / elapsed
    mfu = measure_mfu(model, tps)
    return {"tokens_per_sec": tps, "mfu_percent": mfu, "elapsed_sec": elapsed}
 
# Cookbook sertifika kontrol
result = bench_mfu(model, loader, n_steps=50)
assert result["mfu_percent"] >= 35, (
    f"MFU {result['mfu_percent']:.1f}% — cookbook minimum 35%. "
    f"Bottleneck'i tespit et: torch.profiler ile timeline çıkar."
)
print(f"✅ MFU {result['mfu_percent']:.1f}% — cookbook standardına uygun")

cookbook MFU sertifika bench script'i

🐛 FMD — 'MFU %18, cookbook %35 diyor — bug nerede?'

Hipotezler: (a) FlashAttention 2 aktif değil → attn matmul'lar fp32 fallback → %50 throughput kaybı. Çözüm: `attn_implementation='flash_attention_2'` zorla. (b) Sequence packing kapalı → effective tokens / step düşük. Çözüm: TRL `SFTConfig(packing=True)`. (c) Dataloader bottleneck → GPU idle %30+. Çözüm: profile et, num_workers + prefetch tune et. (d) Gradient checkpointing iki kez wrap'lanmış (HF + Unsloth) → forward 3x. Çözüm: sadece bir kez aktif. (e) AdamW fp32 (not 8-bit) → optimizer step süresi büyük. Çözüm: paged_adamw_8bit. Drill: torch.profiler trace al, top-5 op'a bak, kök nedeni bul.

5. Bench: MFU Across Stages#

Llama 3.1 8B QLoRA, RTX 4090, packing on:

Config	tokens/s	MFU
HF Trainer naive (no FA, no pack)	1900	11%
+ FA2	3800	22%
+ grad-ckpt	4500	26%
+ packing	7290	29%
Unsloth (fused)	12700	45%

Cookbook sertifika eşiği %35 — Unsloth ile rahat geçilir; HF Trainer ile sıkı tuning gerekir.

✅ Teslim

Yukarıdaki `bench_mfu` kodunu çalıştır, kendi Lab'ının MFU'sunu ölç. 2) %35'in altındaysa torch.profiler trace al. 3) Top-3 bottleneck'i belirle, üzerinde çalış, MFU'yu 35'e çıkar. 4) Sonraki ders: 1.8 — Cost Engineering: H100 Saat Fiyatı vs Spot, Breakeven Analizi.

Frequently Asked Questions

RTX 4090'da Unsloth + Flash-Attention 3 + custom Triton'la %55-60 görülmüş. H100'de FP8 + TE ile %70+. %100 fiziksel olarak imkânsız çünkü kernel launch overhead, memory access, etc. var. Cookbook hedefi %35 (Unsloth'siz %29).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Profiling Stack: torch.profiler + Nsight Systems + Nsight Compute + MFU Calculation

1. MFU (Model FLOPs Utilization) Nedir?#

2. torch.profiler — Python-Level Timing#

3. Nsight Systems (nsys) — Kernel-Level Timeline#

Nsight Systems'da neye bakacaksın?#

4. Nsight Compute (ncu) — Kernel-Internal Metrics#

5. Bench: MFU Across Stages#

Frequently Asked Questions

MFU %50'nin üstüne çıkmak mümkün mü?

torch.profiler vs nsys hangisi yeterli?

WSL2'de nsys çalışır mı?

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter