torch.distributed In Depth: DDP, FSDP, ZeRO Stages — Production Distributed Training
We covered NCCL fundamentals in 5.4. Now production distributed training stack: DDP gradient bucketing + overlap, FSDP shard strategies (FULL_SHARD, SHARD_GRAD_OP, HYBRID_SHARD), DeepSpeed ZeRO Stage 1/2/3 comparison, hybrid 3D parallelism. Final bridge to Module 17.
Şükrü Yusuf KAYA
55 min read
Advanced🌐 70B model'i 80GB GPU'da nasıl train edersin?
Modern LLM training distributed framework'sız imkansız. DDP, FSDP, ZeRO — hepsi distributed training'in farklı stratejileri. 5.4'te NCCL ile communication primitive'lerini öğrendik. Şimdi production stack: hangi strateji ne zaman, memory/communication trade-off'ları, 8x H100 cluster'da Llama 3 70B fine-tune. 55 dakika sonra: production LLM mühendisinin bilmesi gereken her şey, Modül 17'ye geçiş hazır.
Ders Haritası#
- Distributed training neden gerekli?
- Data parallelism (DP, DDP)
- DDP gradient bucketing + overlap deep dive
- Model parallelism types: tensor, pipeline, sequence
- ZeRO (Zero Redundancy Optimizer): Stages 1, 2, 3
- FSDP (PyTorch native): FULL_SHARD vs SHARD_GRAD_OP vs HYBRID_SHARD
- DeepSpeed vs FSDP karşılaştırma
- 3D Parallelism: DP × TP × PP
- Production karar matrisi
- Modül 17 köprüsü
1. Distributed Training Neden Gerekli?#
70B Llama 3 fine-tune scenario:
Weight BF16: 140 GB Master FP32: 280 GB Adam states: 560 GB Gradient BF16: 140 GB Activations: 50 GB Total: ~1.17 TB
Tek H100 80GB ile: OOM.
Çözümler#
- Activation checkpointing: %90 activation tasarruf → 14GB. Total 1.13TB. Hâlâ çok.
- 8-bit optimizer: optimizer state 8-bit → 140GB. Hâlâ 720GB.
- Distributed training: model + states birden çok GPU'ya shard.
Hangi axis'te paralleştirelim?#
- Data parallel: her GPU full model, farklı batch shard. Communication: gradient sync.
- Model parallel (tensor): her GPU model'in bir slice'ı. Communication: intermediate activations.
- Pipeline parallel: her GPU bir layer set. Communication: pipeline activations.
- Sequence parallel: aynı layer ama farklı seq pozisyonları. Long context için.
Modern LLM training: hibrit kombinasyon (3D parallelism). Modül 17 derin.
2. Data Parallelism (DP, DDP)#
En basit ve yaygın. Her GPU full model kopyası, farklı mini-batch shard.
Eski: DataParallel (deprecated)#
torch.nn.DataParallelModern: DistributedDataParallel (DDP)#
torch.nn.parallel.DistributedDataParallelimport torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP dist.init_process_group(backend="nccl") local_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank) model = MyModel().cuda() model = DDP(model, device_ids=[local_rank]) optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) sampler = DistributedSampler(dataset) loader = DataLoader(dataset, batch_size=32, sampler=sampler) for batch in loader: out = model(batch.cuda()) loss = criterion(out, target.cuda()) loss.backward() # gradient sync otomatik (allreduce) optimizer.step() optimizer.zero_grad()
Memory dengesi#
Her GPU full model + full optimizer state. 70B model 8 GPU'da:
- Her GPU 1.17 TB tutuyor → hâlâ OOM.
DDP küçük-orta model için. Büyük LLM için FSDP/ZeRO gerek.
Launch#
torchrun --nproc_per_node=8 train.py
3. DDP Gradient Bucketing + Overlap — Deep Dive#
Modern DDP backward sırasında gradient'leri bucket'larla allreduce ediyor.
Naive (without bucketing)#
backward() bitince: allreduce gradient_1 allreduce gradient_2 ... allreduce gradient_N Her allreduce GPU sync requires
Sequential allreduce → backward bitene kadar communication başlamıyor.
DDP bucketing#
backward sırasında: Bucket 1 (25MB gradient hazır) → allreduce START (background) Bucket 2 hazır → allreduce START Bucket 3 hazır → ...
Communication ve computation overlap ediyor. ~%15-25 hızlanma.
Bucket size#
PyTorch default: 25 MB. Tuning:
model = DDP(model, device_ids=[local_rank], bucket_cap_mb=50)
- Küçük bucket (5MB): çok overlap ama daha çok NCCL call → kernel launch overhead
- Büyük bucket (100MB): az overlap ama az NCCL call
25MB sweet spot çoğu workload'da.
gradient_as_bucket_view#
model = DDP(model, device_ids=[local_rank], gradient_as_bucket_view=True)
Memory tasarruf: gradient bucket view olarak — extra tensor yok. Modern PyTorch'ta önerilen.
Static graph (advanced)#
model._set_static_graph()
DDP gradient sync sırasını sabitler → daha agresif optimization. Conditional control flow yoksa kullan.
Mixed precision interaction#
from torch.amp import autocast with autocast(device_type="cuda", dtype=torch.bfloat16): out = model(x) loss = criterion(out, target) loss.backward() # BF16 gradient sync (efficient)
BF16 gradient half-size → communication 2x faster. NCCL BF16 native support PyTorch 2.0+.
4. Model Parallelism Types#
DDP yetmediğinde model'in kendisini paralleştirme. 3 ana tip:
Tensor Parallelism (TP)#
Bir tensor'u (örn. matrix W) column-wise veya row-wise GPU'lara böl. Her GPU partial computation, all_gather ile birleştir.
W: (4096, 4096) GPU 0: W[:, 0:2048] GPU 1: W[:, 2048:4096] Compute: y = x @ W → her GPU partial y, sonra all_gather
Megatron-LM klasik implementation. Communication intensive (her layer'da activation sync).
Pipeline Parallelism (PP)#
Model'i layer-wise GPU'lara böl. Pipeline stages.
GPU 0: layers 0-10 GPU 1: layers 11-20 GPU 2: layers 21-30 GPU 3: layers 31-40 Forward: GPU 0 → 1 → 2 → 3 Backward: ters yönde
GPipe, PipeDream, 1F1B: pipeline scheduling algorithms.
Sequence Parallelism (SP)#
Long context için. Aynı layer'ı sequence position'a göre shard. FlashAttention + SP combination.
ZeRO / FSDP#
Hybrid: optimizer states + gradients + weights shard. Communication mostly DP-style allreduce'a benzer ama optimizer/gradient stages farklı zamanda gather'lı.
Modern LLM training çoğunlukla FSDP veya DeepSpeed ZeRO kullanır. Pure TP/PP frontier lab'larda + custom infrastructure.
5. ZeRO Stages — Zero Redundancy Optimizer#
DeepSpeed ZeRO (Rajbhandari 2020): optimizer states, gradients, weights'i GPU'lara shard et. Memory drastically azaltır.
Memory breakdown problem#
Standard DDP'de her GPU full model'a sahip. 70B model:
- Weight BF16: 140GB
- Master FP32: 280GB
- Adam m+v: 560GB
- Total per GPU: ~1 TB
8 GPU = 8x redundant (ZeRO öncesi).
ZeRO Stage 1: Optimizer States Sharded#
Adam m+v shard'lanıyor → her GPU sadece kendine düşen kısmı tutuyor:
- Optimizer states: 560GB / 8 = 70GB per GPU
- Weight + gradient: full (her GPU 140 + 140 = 280GB)
- Total per GPU: ~430GB
Hâlâ büyük ama %60 tasarruf.
ZeRO Stage 2: + Gradients Sharded#
Stage 1 + gradient shard:
- Optimizer states: 70GB
- Gradients: 140GB / 8 = 17.5GB
- Weight: 140GB (full)
- Total per GPU: ~230GB
%75 tasarruf.
ZeRO Stage 3: + Weights Sharded (= FSDP)#
Stage 2 + weight shard:
- Optimizer states: 70GB
- Gradients: 17.5GB
- Weights: 140GB / 8 = 17.5GB
- Total per GPU: ~105GB
%90 tasarruf. 70B model 8 GPU 80GB'da fit ediyor.
Trade-off: communication#
Stage 3'te weight'ler shard → forward'da all_gather weight, backward'da reduce_scatter gradient.
Per-step communication:
- Stage 1: optimizer sync sonu (1x M)
- Stage 2: gradient reduce (1x M)
- Stage 3: weight gather (forward) + gradient scatter (backward) (~2x M)
Stage 3 en çok memory tasarruf, en çok communication.
Pratik öneri#
| Memory durumu | Önerilen Stage |
|---|---|
| Yeterince GPU memory | Stage 1 (en az comm) |
| Memory tight, communication budget OK | Stage 2 |
| Memory critical | Stage 3 |
LLM training (70B+) modern öneri: Stage 3 (= FSDP FULL_SHARD).
6. FSDP (PyTorch Native) — Shard Strategies#
FSDP (Fully Sharded Data Parallel, PyTorch 1.11+) — ZeRO Stage 3 native PyTorch implementation. DeepSpeed dependency olmadan.
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp import ShardingStrategy model = MyModel().cuda() model = FSDP( model, sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3 eşdeğeri auto_wrap_policy=..., # nested FSDP wrapping mixed_precision=..., # BF16 mixed )
Sharding strategies#
FULL_SHARD (ZeRO-3)
Weight, gradient, optimizer state — hepsi shard. Max memory savings, max communication.
SHARD_GRAD_OP (ZeRO-2 eşdeğeri)
Gradient + optimizer state shard. Weight tutar full. Daha az communication.
NO_SHARD (DDP eşdeğeri)
Hiçbir şey shard değil. Pure DDP. Memory savings yok.
HYBRID_SHARD (yeni, 2023+)
Çok güçlü pattern: cluster içinde FULL_SHARD (NVLink fast), cluster'lar arası NO_SHARD (data parallel).
Cluster 1: GPU 0-7 FULL_SHARD (intra-node) Cluster 2: GPU 8-15 FULL_SHARD (intra-node) Cluster 1 ↔ Cluster 2: DDP (inter-node)
Multi-node training'de çok daha hızlı: intra-node NVLink (600 GB/s), inter-node InfiniBand (300 GB/s). FULL_SHARD inter-node yapmak yavaş — HYBRID_SHARD bunu çözüyor.
_HYBRID_SHARD_ZERO2 (Stage-2 hybrid)
Intra-node SHARD_GRAD_OP, inter-node DDP. Memory daha az ama hâlâ savings.
auto_wrap_policy#
FSDP nested wrapping istiyor — her transformer block ayrı FSDP unit. Tüm modeli tek FSDP ile sarmak verimsiz.
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy my_auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls={TransformerBlock}, # Llama block class ) model = FSDP(model, auto_wrap_policy=my_auto_wrap_policy)
Her TransformerBlock ayrı FSDP unit → her layer için ayrı all_gather/reduce_scatter (computation overlap).
Mixed precision in FSDP#
from torch.distributed.fsdp import MixedPrecision mp_policy = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ) model = FSDP(model, mixed_precision=mp_policy)
7. DeepSpeed vs FSDP — Karşılaştırma#
İkisi de ZeRO yapıyor. Farklar:
DeepSpeed#
- Microsoft Research (2020)
- ZeRO sahibi (orijinal paper)
- Çok-feature: pipeline parallelism, MoE, inference engine, training optimization
- Karmaşık config — 100+ option
ds_config.json - Production'da güçlü
FSDP (PyTorch native)#
- Meta (PyTorch 1.11+, mature 2.0+)
- Daha basit API
- PyTorch native —
torch.distributed.fsdp - Modern Llama 3 training Meta'da FSDP kullanıldı
Hangisi?#
| Use case | Önerilen |
|---|---|
| LLM training (70B-) | FSDP (basit, sufficient) |
| LLM training (>100B) | DeepSpeed (Pipeline parallel + 3D combine) |
| Research, iterative | FSDP (PyTorch ecosystem) |
| HuggingFace transformers + Accelerate | Her ikisi (Accelerate ikisini destekliyor) |
| Production-grade big model | DeepSpeed (battle-tested) |
Migration#
Modern trend: FSDP → her workload için yeterli. Llama 3, Mistral, çoğu açık-kaynak Meta'nın patterns'i. DeepSpeed enterprise + Microsoft ecosystem'de güçlü.
HuggingFace Accelerate#
İkisini de wraplıyor:
accelerate config # interactive setup accelerate launch train.py
Beginner-friendly, abstract'lıyor. Production'da bazen low-level access kaybediliyor.
8. 3D Parallelism — DP × TP × PP#
Frontier-scale (1T+ param) training için tek strateji yetmiyor. 3D = 3 axis kombinasyon:
Axis 1: Data Parallelism (DP)#
Her DP rank farklı batch shard. Communication: allreduce gradient.
Axis 2: Tensor Parallelism (TP)#
Aynı layer ama TP rank'lar matrix shard. Communication: every-layer all_gather.
Axis 3: Pipeline Parallelism (PP)#
Farklı layer'lar farklı PP rank'larda. Communication: pipeline activations.
Hybrid example#
GPT-4 hypothesis (rumored): ~25,000 GPU DP=16, TP=8, PP=200 = 25,600 total positions DP=16: 16 different batch shards TP=8: 8 GPUs share each layer's matrices PP=200: model split into 200 pipeline stages
Niye 3D?#
- TP alone: per-layer communication çok → bandwidth limit
- PP alone: pipeline bubble (idle GPU) → throughput düşer
- DP alone: memory limit (full model her GPU)
- 3D combine: her axis'in zayıflıkları diğeriyle compensate
Practical sweet spot#
Llama 3 70B training: DP=128 × TP=8 (TP intra-node, DP cross-node) — pipeline parallel kullanılmadı (~25K GPU değil).
Sequence Parallel (SP) — 4. axis#
Long context (128K+) için: SP ile sequence dimension shard. SP × TP × DP = effectively 4D. Llama 3 long-context training'de kullanıldı.
Modül 17 (Distributed Training) bunu fully detaylandırıyor.
9. Production Karar Matrisi#
Model size + hardware + budget → distributed strategy seçimi.
Senaryo 1: 7B model, 4× A100 80GB#
- Memory comfort: yeterli
- Strategy: DDP veya FSDP NO_SHARD
- Reason: communication-efficient, basit
Senaryo 2: 13B model, 8× A100 80GB#
- Memory tight (13B × 20 byte = 260 GB)
- Strategy: FSDP SHARD_GRAD_OP
- Reason: optimizer + gradient shard, weight full
Senaryo 3: 70B model, 8× H100 80GB#
- Memory critical
- Strategy: FSDP FULL_SHARD
- Reason: full sharding gerekli
Senaryo 4: 70B model, 64× H100 (8 node × 8 GPU)#
- Multi-node, mixed bandwidth
- Strategy: FSDP HYBRID_SHARD (intra-node FULL_SHARD, inter-node DP)
- Reason: NVLink fast intra, InfiniBand slower inter
Senaryo 5: 405B Llama 3.1, 1024× H100#
- Frontier scale
- Strategy: 3D parallelism (FSDP + TP + sequence parallel)
- Reason: tek axis yetmez
Senaryo 6: Fine-tuning 7B, 1× A100 40GB#
- Single GPU
- Strategy: LoRA + QLoRA (Modül 21)
- Reason: distributed gerek değil, parameter-efficient
Türk perspektif#
Türk şirketlerin %95'i Senaryo 1-3 arasında. Modal/Runpod gibi 1-8 GPU rental. DDP veya FSDP yeter. Frontier (Senaryo 5) sadece akademik araştırma veya TÜBİTAK ölçeği projelerde.
python
import osimport torchimport torch.distributed as distfrom torch.distributed.fsdp import ( FullyShardedDataParallel as FSDP, ShardingStrategy, MixedPrecision, BackwardPrefetch,)from torch.distributed.fsdp.wrap import transformer_auto_wrap_policyfrom functools import partial # Initdist.init_process_group(backend="nccl")local_rank = int(os.environ["LOCAL_RANK"])torch.cuda.set_device(local_rank) # Modelfrom transformers import AutoModelForCausalLMmodel = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") # FSDP wrap policyfrom transformers.models.llama.modeling_llama import LlamaDecoderLayerauto_wrap_policy = partial( transformer_auto_wrap_policy, transformer_layer_cls={LlamaDecoderLayer},) # Mixed precisionmp = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16,) # Wrap with FSDPmodel = FSDP( model, sharding_strategy=ShardingStrategy.FULL_SHARD, auto_wrap_policy=auto_wrap_policy, mixed_precision=mp, device_id=local_rank, backward_prefetch=BackwardPrefetch.BACKWARD_PRE, # overlap optimization use_orig_params=True,) # Optimizer (FSDP automatically handles state)optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) # Training loopfor batch in dataloader: outputs = model(batch.input_ids, labels=batch.labels) loss = outputs.loss loss.backward() optimizer.step() optimizer.zero_grad()Production-grade FSDP setup — Llama 3 fine-tuning.
10. Modül 17'ye Köprü#
Bu ders distributed training temellerini verdi. Modül 17 (Pre-training Compute Engineering) detayda:
- DP, FSDP/ZeRO derin matematik
- Tensor parallelism Megatron-LM stili
- Pipeline parallelism GPipe + 1F1B + Interleaved
- Sequence parallelism + Ring attention
- 3D parallelism configuration optimization
- Communication pattern analysis (NVLink, IB, RDMA)
- Failure recovery + checkpoint
- Elastic training (dynamic resize)
- Performance modeling
Ön hazırlık checklist#
Modül 17'e geçmeden önce:
- DDP ile basit MLP training (2 GPU)
- FSDP FULL_SHARD ile 7B model fine-tune deneme
- Memory profiling (Modül 5.3) sırasında FSDP shard etkisi
- NCCL_DEBUG=INFO ile communication trace okuma
- HuggingFace Accelerate config etme
Pratik öneri#
Production'a girişin önce:
- Single GPU'da prototyping
- DDP ile 2-4 GPU validation
- FSDP ile büyük scale
- Modül 17 pattern'leri ile optimization
11. Mini Egzersizler#
-
Memory savings: 13B model 8× A100 40GB. DDP vs FSDP FULL_SHARD memory comparison?
-
Communication cost: 70B model, 8 GPU FSDP FULL_SHARD. Per-step communication volume?
-
HYBRID_SHARD karar: 4 node × 8 GPU = 32 GPU. NVLink intra-node, InfiniBand inter-node. Hangi strategy?
-
DDP bucket size: Llama 3 8B fine-tune, A100 cluster. Optimal bucket_cap_mb?
-
DeepSpeed vs FSDP: 400B custom model training. Hangisi?
Bu Derste Neler Öğrendik?#
✓ Distributed training neden — 70B model 1.17TB memory
✓ Data parallel (DDP) — full replication + gradient sync
✓ DDP gradient bucketing + overlap — production pattern
✓ Model parallel types: TP (Megatron), PP (GPipe), SP (Ring attention)
✓ ZeRO Stage 1/2/3 — optimizer/gradient/weight sharding
✓ FSDP strategies: FULL_SHARD, SHARD_GRAD_OP, HYBRID_SHARD, NO_SHARD
✓ DeepSpeed vs FSDP karşılaştırma
✓ 3D parallelism — DP × TP × PP frontier-scale
✓ Production karar matrisi — model size × hardware × strategy
✓ Modül 17 köprüsü — derin distributed training için hazır
Sıradaki Ders#
5.7 — Debug Arsenal: register_hook, anomaly mode, torch.utils.benchmark
Production'da iş bozulduğunda toolkit. Hooks ile forward/backward intercept, anomaly detection, reproducibility patterns, deterministic mode. LLM mühendisinin günlük debug arsenali.
Frequently Asked Questions
Generally small overhead (~5-15%) but memory savings are dramatic. FSDP communication: forward all_gather + backward reduce_scatter vs DDP's single allreduce. NVLink-rich cluster (8x H100) — this difference isn't critical. Inter-node InfiniBand increases the gap (mitigated by HYBRID_SHARD). Practical decision: memory comfortable → DDP (fastest). Memory tight → FSDP necessary, tolerate 10% overhead.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup