torch.compile + Inductor: Reduce-Overhead + Dynamic Shapes + Recompile Watcher
PyTorch 2.x'in flagship feature'ı: torch.compile. Inductor backend (Triton kernel generation), 3 mod (default, reduce-overhead, max-autotune), dynamic shapes (recompile gözcüsü), CUDA graphs, FT training pipeline'a entegrasyon. RTX 4090 + Llama 3.1 8B FT throughput +%15.
Şükrü Yusuf KAYA
26 dakikalık okuma
İleripython
# === torch.compile training entegrasyonu ===import torchfrom transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3.1-8B-Instruct", torch_dtype="bfloat16", attn_implementation="flash_attention_2",).cuda() # Compile model# mode options:# "default" — safe, balanced# "reduce-overhead" — CUDA graphs, en hızlı stable shapes için# "max-autotune" — agresif autotune, ilk warmup uzunmodel = torch.compile( model, mode="reduce-overhead", fullgraph=False, # False — partial graph (safer) dynamic=True, # Variable shapes (seq length)) # Training loop unchangedfor batch in loader: loss = model(**batch).loss loss.backward() ... # Bench:# Vanilla: 1.78 step/s# + torch.compile default: 1.92 step/s (+%8)# + reduce-overhead: 2.04 step/s (+%15)torch.compile + training
1. Recompile Watcher#
torch.compileimport torch._dynamo torch._dynamo.config.suppress_errors = True torch._dynamo.config.capture_scalar_outputs = True torch._dynamo.config.cache_size_limit = 64 # default 8, FT için yetmez # Log recompile events torch._dynamo.config.verbose = True
FT'de yaygın recompile sebepleri:
- Sequence length değişimi (packing on/off)
- Batch size dinamik
- Grad-ckpt on/off mid-training
Çözüm: Sabit shapes (, fixed) → recompile 0.
max_seq_length=4096batch_size=2✅ Teslim
- Llama 8B FT'inde compile aç/kapat throughput karşılaştır. 2) Recompile log analiz. 3) Sonraki ders: 13.7 — CUDA Graph Capture.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
İlgili İçerikler
Part 0 — Engineering Foundations
Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı
Öğrenmeye BaşlaPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir
Öğrenmeye BaşlaPart 0 — Engineering Foundations