Skip to content

torch.distributed In Depth: DDP, FSDP, ZeRO Stages — Production Distributed Training

We covered NCCL fundamentals in 5.4. Now production distributed training stack: DDP gradient bucketing + overlap, FSDP shard strategies (FULL_SHARD, SHARD_GRAD_OP, HYBRID_SHARD), DeepSpeed ZeRO Stage 1/2/3 comparison, hybrid 3D parallelism. Final bridge to Module 17.

Şükrü Yusuf KAYA
55 min read
Advanced
torch.distributed Derinleştirilmiş: DDP, FSDP, ZeRO Stages — Production Distributed Training
🌐 70B model'i 80GB GPU'da nasıl train edersin?
Modern LLM training distributed framework'sız imkansız. DDP, FSDP, ZeRO — hepsi distributed training'in farklı stratejileri. 5.4'te NCCL ile communication primitive'lerini öğrendik. Şimdi production stack: hangi strateji ne zaman, memory/communication trade-off'ları, 8x H100 cluster'da Llama 3 70B fine-tune. 55 dakika sonra: production LLM mühendisinin bilmesi gereken her şey, Modül 17'ye geçiş hazır.

Ders Haritası#

  1. Distributed training neden gerekli?
  2. Data parallelism (DP, DDP)
  3. DDP gradient bucketing + overlap deep dive
  4. Model parallelism types: tensor, pipeline, sequence
  5. ZeRO (Zero Redundancy Optimizer): Stages 1, 2, 3
  6. FSDP (PyTorch native): FULL_SHARD vs SHARD_GRAD_OP vs HYBRID_SHARD
  7. DeepSpeed vs FSDP karşılaştırma
  8. 3D Parallelism: DP × TP × PP
  9. Production karar matrisi
  10. Modül 17 köprüsü

1. Distributed Training Neden Gerekli?#

70B Llama 3 fine-tune scenario:
Weight BF16: 140 GB Master FP32: 280 GB Adam states: 560 GB Gradient BF16: 140 GB Activations: 50 GB Total: ~1.17 TB
Tek H100 80GB ile: OOM.

Çözümler#

  1. Activation checkpointing: %90 activation tasarruf → 14GB. Total 1.13TB. Hâlâ çok.
  2. 8-bit optimizer: optimizer state 8-bit → 140GB. Hâlâ 720GB.
  3. Distributed training: model + states birden çok GPU'ya shard.

Hangi axis'te paralleştirelim?#

  1. Data parallel: her GPU full model, farklı batch shard. Communication: gradient sync.
  2. Model parallel (tensor): her GPU model'in bir slice'ı. Communication: intermediate activations.
  3. Pipeline parallel: her GPU bir layer set. Communication: pipeline activations.
  4. Sequence parallel: aynı layer ama farklı seq pozisyonları. Long context için.
Modern LLM training: hibrit kombinasyon (3D parallelism). Modül 17 derin.

2. Data Parallelism (DP, DDP)#

En basit ve yaygın. Her GPU full model kopyası, farklı mini-batch shard.

Eski: DataParallel (deprecated)#

torch.nn.DataParallel
— single process, multi-thread. Python GIL bottleneck. Kullanma.

Modern: DistributedDataParallel (DDP)#

torch.nn.parallel.DistributedDataParallel
— multi-process. Her GPU ayrı Python process. NCCL communication.
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP dist.init_process_group(backend="nccl") local_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank) model = MyModel().cuda() model = DDP(model, device_ids=[local_rank]) optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) sampler = DistributedSampler(dataset) loader = DataLoader(dataset, batch_size=32, sampler=sampler) for batch in loader: out = model(batch.cuda()) loss = criterion(out, target.cuda()) loss.backward() # gradient sync otomatik (allreduce) optimizer.step() optimizer.zero_grad()

Memory dengesi#

Her GPU full model + full optimizer state. 70B model 8 GPU'da:
  • Her GPU 1.17 TB tutuyor → hâlâ OOM.
DDP küçük-orta model için. Büyük LLM için FSDP/ZeRO gerek.

Launch#

torchrun --nproc_per_node=8 train.py

3. DDP Gradient Bucketing + Overlap — Deep Dive#

Modern DDP backward sırasında gradient'leri bucket'larla allreduce ediyor.

Naive (without bucketing)#

backward() bitince: allreduce gradient_1 allreduce gradient_2 ... allreduce gradient_N Her allreduce GPU sync requires
Sequential allreduce → backward bitene kadar communication başlamıyor.

DDP bucketing#

backward sırasında: Bucket 1 (25MB gradient hazır) → allreduce START (background) Bucket 2 hazır → allreduce START Bucket 3 hazır → ...
Communication ve computation overlap ediyor. ~%15-25 hızlanma.

Bucket size#

PyTorch default: 25 MB. Tuning:
model = DDP(model, device_ids=[local_rank], bucket_cap_mb=50)
  • Küçük bucket (5MB): çok overlap ama daha çok NCCL call → kernel launch overhead
  • Büyük bucket (100MB): az overlap ama az NCCL call
25MB sweet spot çoğu workload'da.

gradient_as_bucket_view#

model = DDP(model, device_ids=[local_rank], gradient_as_bucket_view=True)
Memory tasarruf: gradient bucket view olarak — extra tensor yok. Modern PyTorch'ta önerilen.

Static graph (advanced)#

model._set_static_graph()
DDP gradient sync sırasını sabitler → daha agresif optimization. Conditional control flow yoksa kullan.

Mixed precision interaction#

from torch.amp import autocast with autocast(device_type="cuda", dtype=torch.bfloat16): out = model(x) loss = criterion(out, target) loss.backward() # BF16 gradient sync (efficient)
BF16 gradient half-size → communication 2x faster. NCCL BF16 native support PyTorch 2.0+.

4. Model Parallelism Types#

DDP yetmediğinde model'in kendisini paralleştirme. 3 ana tip:

Tensor Parallelism (TP)#

Bir tensor'u (örn. matrix W) column-wise veya row-wise GPU'lara böl. Her GPU partial computation, all_gather ile birleştir.
W: (4096, 4096) GPU 0: W[:, 0:2048] GPU 1: W[:, 2048:4096] Compute: y = x @ W → her GPU partial y, sonra all_gather
Megatron-LM klasik implementation. Communication intensive (her layer'da activation sync).

Pipeline Parallelism (PP)#

Model'i layer-wise GPU'lara böl. Pipeline stages.
GPU 0: layers 0-10 GPU 1: layers 11-20 GPU 2: layers 21-30 GPU 3: layers 31-40 Forward: GPU 0 → 1 → 2 → 3 Backward: ters yönde
GPipe, PipeDream, 1F1B: pipeline scheduling algorithms.

Sequence Parallelism (SP)#

Long context için. Aynı layer'ı sequence position'a göre shard. FlashAttention + SP combination.

ZeRO / FSDP#

Hybrid: optimizer states + gradients + weights shard. Communication mostly DP-style allreduce'a benzer ama optimizer/gradient stages farklı zamanda gather'lı.
Modern LLM training çoğunlukla FSDP veya DeepSpeed ZeRO kullanır. Pure TP/PP frontier lab'larda + custom infrastructure.

5. ZeRO Stages — Zero Redundancy Optimizer#

DeepSpeed ZeRO (Rajbhandari 2020): optimizer states, gradients, weights'i GPU'lara shard et. Memory drastically azaltır.

Memory breakdown problem#

Standard DDP'de her GPU full model'a sahip. 70B model:
  • Weight BF16: 140GB
  • Master FP32: 280GB
  • Adam m+v: 560GB
  • Total per GPU: ~1 TB
8 GPU = 8x redundant (ZeRO öncesi).

ZeRO Stage 1: Optimizer States Sharded#

Adam m+v shard'lanıyor → her GPU sadece kendine düşen kısmı tutuyor:
  • Optimizer states: 560GB / 8 = 70GB per GPU
  • Weight + gradient: full (her GPU 140 + 140 = 280GB)
  • Total per GPU: ~430GB
Hâlâ büyük ama %60 tasarruf.

ZeRO Stage 2: + Gradients Sharded#

Stage 1 + gradient shard:
  • Optimizer states: 70GB
  • Gradients: 140GB / 8 = 17.5GB
  • Weight: 140GB (full)
  • Total per GPU: ~230GB
%75 tasarruf.

ZeRO Stage 3: + Weights Sharded (= FSDP)#

Stage 2 + weight shard:
  • Optimizer states: 70GB
  • Gradients: 17.5GB
  • Weights: 140GB / 8 = 17.5GB
  • Total per GPU: ~105GB
%90 tasarruf. 70B model 8 GPU 80GB'da fit ediyor.

Trade-off: communication#

Stage 3'te weight'ler shard → forward'da all_gather weight, backward'da reduce_scatter gradient.
Per-step communication:
  • Stage 1: optimizer sync sonu (1x M)
  • Stage 2: gradient reduce (1x M)
  • Stage 3: weight gather (forward) + gradient scatter (backward) (~2x M)
Stage 3 en çok memory tasarruf, en çok communication.

Pratik öneri#

Memory durumuÖnerilen Stage
Yeterince GPU memoryStage 1 (en az comm)
Memory tight, communication budget OKStage 2
Memory criticalStage 3
LLM training (70B+) modern öneri: Stage 3 (= FSDP FULL_SHARD).

6. FSDP (PyTorch Native) — Shard Strategies#

FSDP (Fully Sharded Data Parallel, PyTorch 1.11+) — ZeRO Stage 3 native PyTorch implementation. DeepSpeed dependency olmadan.
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP from torch.distributed.fsdp import ShardingStrategy model = MyModel().cuda() model = FSDP( model, sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3 eşdeğeri auto_wrap_policy=..., # nested FSDP wrapping mixed_precision=..., # BF16 mixed )

Sharding strategies#

FULL_SHARD (ZeRO-3)

Weight, gradient, optimizer state — hepsi shard. Max memory savings, max communication.

SHARD_GRAD_OP (ZeRO-2 eşdeğeri)

Gradient + optimizer state shard. Weight tutar full. Daha az communication.

NO_SHARD (DDP eşdeğeri)

Hiçbir şey shard değil. Pure DDP. Memory savings yok.

HYBRID_SHARD (yeni, 2023+)

Çok güçlü pattern: cluster içinde FULL_SHARD (NVLink fast), cluster'lar arası NO_SHARD (data parallel).
Cluster 1: GPU 0-7 FULL_SHARD (intra-node) Cluster 2: GPU 8-15 FULL_SHARD (intra-node) Cluster 1 ↔ Cluster 2: DDP (inter-node)
Multi-node training'de çok daha hızlı: intra-node NVLink (600 GB/s), inter-node InfiniBand (300 GB/s). FULL_SHARD inter-node yapmak yavaş — HYBRID_SHARD bunu çözüyor.

_HYBRID_SHARD_ZERO2 (Stage-2 hybrid)

Intra-node SHARD_GRAD_OP, inter-node DDP. Memory daha az ama hâlâ savings.

auto_wrap_policy#

FSDP nested wrapping istiyor — her transformer block ayrı FSDP unit. Tüm modeli tek FSDP ile sarmak verimsiz.
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy my_auto_wrap_policy = functools.partial( transformer_auto_wrap_policy, transformer_layer_cls={TransformerBlock}, # Llama block class ) model = FSDP(model, auto_wrap_policy=my_auto_wrap_policy)
Her TransformerBlock ayrı FSDP unit → her layer için ayrı all_gather/reduce_scatter (computation overlap).

Mixed precision in FSDP#

from torch.distributed.fsdp import MixedPrecision mp_policy = MixedPrecision( param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, ) model = FSDP(model, mixed_precision=mp_policy)

7. DeepSpeed vs FSDP — Karşılaştırma#

İkisi de ZeRO yapıyor. Farklar:

DeepSpeed#

  • Microsoft Research (2020)
  • ZeRO sahibi (orijinal paper)
  • Çok-feature: pipeline parallelism, MoE, inference engine, training optimization
  • Karmaşık config
    ds_config.json
    100+ option
  • Production'da güçlü

FSDP (PyTorch native)#

  • Meta (PyTorch 1.11+, mature 2.0+)
  • Daha basit API
  • PyTorch native —
    torch.distributed.fsdp
  • Modern Llama 3 training Meta'da FSDP kullanıldı

Hangisi?#

Use caseÖnerilen
LLM training (70B-)FSDP (basit, sufficient)
LLM training (>100B)DeepSpeed (Pipeline parallel + 3D combine)
Research, iterativeFSDP (PyTorch ecosystem)
HuggingFace transformers + AccelerateHer ikisi (Accelerate ikisini destekliyor)
Production-grade big modelDeepSpeed (battle-tested)

Migration#

Modern trend: FSDP → her workload için yeterli. Llama 3, Mistral, çoğu açık-kaynak Meta'nın patterns'i. DeepSpeed enterprise + Microsoft ecosystem'de güçlü.

HuggingFace Accelerate#

İkisini de wraplıyor:
accelerate config # interactive setup accelerate launch train.py
Beginner-friendly, abstract'lıyor. Production'da bazen low-level access kaybediliyor.

8. 3D Parallelism — DP × TP × PP#

Frontier-scale (1T+ param) training için tek strateji yetmiyor. 3D = 3 axis kombinasyon:

Axis 1: Data Parallelism (DP)#

Her DP rank farklı batch shard. Communication: allreduce gradient.

Axis 2: Tensor Parallelism (TP)#

Aynı layer ama TP rank'lar matrix shard. Communication: every-layer all_gather.

Axis 3: Pipeline Parallelism (PP)#

Farklı layer'lar farklı PP rank'larda. Communication: pipeline activations.

Hybrid example#

GPT-4 hypothesis (rumored): ~25,000 GPU DP=16, TP=8, PP=200 = 25,600 total positions DP=16: 16 different batch shards TP=8: 8 GPUs share each layer's matrices PP=200: model split into 200 pipeline stages

Niye 3D?#

  • TP alone: per-layer communication çok → bandwidth limit
  • PP alone: pipeline bubble (idle GPU) → throughput düşer
  • DP alone: memory limit (full model her GPU)
  • 3D combine: her axis'in zayıflıkları diğeriyle compensate

Practical sweet spot#

Llama 3 70B training: DP=128 × TP=8 (TP intra-node, DP cross-node) — pipeline parallel kullanılmadı (~25K GPU değil).

Sequence Parallel (SP) — 4. axis#

Long context (128K+) için: SP ile sequence dimension shard. SP × TP × DP = effectively 4D. Llama 3 long-context training'de kullanıldı.
Modül 17 (Distributed Training) bunu fully detaylandırıyor.

9. Production Karar Matrisi#

Model size + hardware + budget → distributed strategy seçimi.

Senaryo 1: 7B model, 4× A100 80GB#

  • Memory comfort: yeterli
  • Strategy: DDP veya FSDP NO_SHARD
  • Reason: communication-efficient, basit

Senaryo 2: 13B model, 8× A100 80GB#

  • Memory tight (13B × 20 byte = 260 GB)
  • Strategy: FSDP SHARD_GRAD_OP
  • Reason: optimizer + gradient shard, weight full

Senaryo 3: 70B model, 8× H100 80GB#

  • Memory critical
  • Strategy: FSDP FULL_SHARD
  • Reason: full sharding gerekli

Senaryo 4: 70B model, 64× H100 (8 node × 8 GPU)#

  • Multi-node, mixed bandwidth
  • Strategy: FSDP HYBRID_SHARD (intra-node FULL_SHARD, inter-node DP)
  • Reason: NVLink fast intra, InfiniBand slower inter

Senaryo 5: 405B Llama 3.1, 1024× H100#

  • Frontier scale
  • Strategy: 3D parallelism (FSDP + TP + sequence parallel)
  • Reason: tek axis yetmez

Senaryo 6: Fine-tuning 7B, 1× A100 40GB#

  • Single GPU
  • Strategy: LoRA + QLoRA (Modül 21)
  • Reason: distributed gerek değil, parameter-efficient

Türk perspektif#

Türk şirketlerin %95'i Senaryo 1-3 arasında. Modal/Runpod gibi 1-8 GPU rental. DDP veya FSDP yeter. Frontier (Senaryo 5) sadece akademik araştırma veya TÜBİTAK ölçeği projelerde.
python
import os
import torch
import torch.distributed as dist
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
ShardingStrategy,
MixedPrecision,
BackwardPrefetch,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from functools import partial
 
# Init
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
 
# Model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
 
# FSDP wrap policy
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
auto_wrap_policy = partial(
transformer_auto_wrap_policy,
transformer_layer_cls={LlamaDecoderLayer},
)
 
# Mixed precision
mp = MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
)
 
# Wrap with FSDP
model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD,
auto_wrap_policy=auto_wrap_policy,
mixed_precision=mp,
device_id=local_rank,
backward_prefetch=BackwardPrefetch.BACKWARD_PRE, # overlap optimization
use_orig_params=True,
)
 
# Optimizer (FSDP automatically handles state)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
 
# Training loop
for batch in dataloader:
outputs = model(batch.input_ids, labels=batch.labels)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
Production-grade FSDP setup — Llama 3 fine-tuning.

10. Modül 17'ye Köprü#

Bu ders distributed training temellerini verdi. Modül 17 (Pre-training Compute Engineering) detayda:
  • DP, FSDP/ZeRO derin matematik
  • Tensor parallelism Megatron-LM stili
  • Pipeline parallelism GPipe + 1F1B + Interleaved
  • Sequence parallelism + Ring attention
  • 3D parallelism configuration optimization
  • Communication pattern analysis (NVLink, IB, RDMA)
  • Failure recovery + checkpoint
  • Elastic training (dynamic resize)
  • Performance modeling

Ön hazırlık checklist#

Modül 17'e geçmeden önce:
  • DDP ile basit MLP training (2 GPU)
  • FSDP FULL_SHARD ile 7B model fine-tune deneme
  • Memory profiling (Modül 5.3) sırasında FSDP shard etkisi
  • NCCL_DEBUG=INFO ile communication trace okuma
  • HuggingFace Accelerate config etme

Pratik öneri#

Production'a girişin önce:
  1. Single GPU'da prototyping
  2. DDP ile 2-4 GPU validation
  3. FSDP ile büyük scale
  4. Modül 17 pattern'leri ile optimization

11. Mini Egzersizler#

  1. Memory savings: 13B model 8× A100 40GB. DDP vs FSDP FULL_SHARD memory comparison?
  2. Communication cost: 70B model, 8 GPU FSDP FULL_SHARD. Per-step communication volume?
  3. HYBRID_SHARD karar: 4 node × 8 GPU = 32 GPU. NVLink intra-node, InfiniBand inter-node. Hangi strategy?
  4. DDP bucket size: Llama 3 8B fine-tune, A100 cluster. Optimal bucket_cap_mb?
  5. DeepSpeed vs FSDP: 400B custom model training. Hangisi?

Bu Derste Neler Öğrendik?#

Distributed training neden — 70B model 1.17TB memory ✓ Data parallel (DDP) — full replication + gradient sync ✓ DDP gradient bucketing + overlap — production pattern ✓ Model parallel types: TP (Megatron), PP (GPipe), SP (Ring attention) ✓ ZeRO Stage 1/2/3 — optimizer/gradient/weight sharding ✓ FSDP strategies: FULL_SHARD, SHARD_GRAD_OP, HYBRID_SHARD, NO_SHARD ✓ DeepSpeed vs FSDP karşılaştırma ✓ 3D parallelism — DP × TP × PP frontier-scale ✓ Production karar matrisi — model size × hardware × strategy ✓ Modül 17 köprüsü — derin distributed training için hazır

Sıradaki Ders#

5.7 — Debug Arsenal: register_hook, anomaly mode, torch.utils.benchmark Production'da iş bozulduğunda toolkit. Hooks ile forward/backward intercept, anomaly detection, reproducibility patterns, deterministic mode. LLM mühendisinin günlük debug arsenali.

Frequently Asked Questions

Generally small overhead (~5-15%) but memory savings are dramatic. FSDP communication: forward all_gather + backward reduce_scatter vs DDP's single allreduce. NVLink-rich cluster (8x H100) — this difference isn't critical. Inter-node InfiniBand increases the gap (mitigated by HYBRID_SHARD). Practical decision: memory comfortable → DDP (fastest). Memory tight → FSDP necessary, tolerate 10% overhead.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content