Data Parallelism (DDP): Foundation of Multi-GPU LLM Training — AllReduce and NCCL Anatomy
Distributed Data Parallel (DDP) anatomy: model replication across GPUs, mini-batch split, forward/backward independent per GPU, gradient AllReduce synchronization. NCCL (NVIDIA Collective Communication Library), ring-allreduce algorithm, bandwidth math. PyTorch DDP API, launch scripts, common pitfalls (uneven batches, batch norm sync).
Şükrü Yusuf KAYA
70 min read
Advanced🔗 DDP — modern LLM training'in temel taşı
Llama-3-8B'i tek H100'de eğitemezsin (8B param + activations + optimizer state = 80GB+ → tek GPU yetmez). 64+ GPU lazım. Distributed training paradigm'i 3 ana approach: data parallel (DDP), model parallel (TP, PP), sharded (FSDP, ZeRO). Bu derste data parallel'in temel taşı: model her GPU'da replicate, mini-batch split. Gradients AllReduce ile sync. NCCL bu sync'i optimize eden NVIDIA library — ring-allreduce algorithm. 70 dakika sonra: DDP'nin matematiksel ve sistemsel anatomisini, PyTorch DDP API'sini, production launch scripts'ini, common pitfalls'ı derinlemesine kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- Tek GPU sınırı — niye distributed gerekli
- DDP yaklaşımı — model replicate + batch split
- Forward pass independent — paralel compute
- Backward gradient sync — AllReduce primitives
- NCCL ve ring-allreduce — efficient bandwidth
- PyTorch DDP API — DDP class, init
- Launch scripts — torchrun, slurm
- Effective batch size — DDP scaling
- Common pitfalls — uneven batches, BN sync
- DDP limit — niye DDP yetmiyor 7B+
1-4. DDP Architecture#
1.1 Tek GPU sınırı#
Llama-3-8B training memory needs:
- Params: 8B × 4 byte (fp32) = 32 GB
- Gradients: 8B × 4 byte = 32 GB
- Optimizer state (AdamW: m + v): 2 × 8B × 4 = 64 GB
- Activations: 2048 × 8192 × 32 layers × 2 byte = 1 GB (bf16, attention saves)
- Total: ~130 GB
H100 80GB single GPU yetmez! Birden fazla GPU lazım.
1.2 DDP yaklaşımı#
Her GPU tam model copy tutar. Mini-batch'i N GPU'ya böl:
Batch = 32 sequences 4 GPU → her GPU 8 sequence işler
Forward pass independent — her GPU kendi mini-batch'inde compute.
1.3 Backward gradient sync#
Forward → backward → gradient hesabı her GPU'da.
Key: gradient'leri sync et — her GPU average gradient'i kullanmalı.
# Per-GPU local gradient g_local_i = backward() # AllReduce: sum all gradients across GPUs, divide by N g_global = AllReduce(g_local_i) / N # Optimizer update with synced gradient weights -= lr * g_global
1.4 Mathematical equivalence#
DDP loss:
L = sum over all samples / total batch size = (sum per-GPU losses) / total batch = mean of per-GPU losses (when per-GPU batch equal)
Gradient: ∇L = average of per-GPU gradients. AllReduce mean = correct gradient.
1.5 Memory not saved#
DDP memory hiç tasarruf etmez — her GPU tam model + gradient + optimizer state.
DDP only saves time (parallel compute) ama 8B model 130 GB single GPU memory hâlâ engelliyor.
Çözüm: FSDP/ZeRO (Ders 13.2).
5. AllReduce + NCCL#
5.1 AllReduce primitive#
AllReduce: N node'da local value'lar, hepsi sum (veya mean) almak istiyor.
Node 1: x_1 Node 2: x_2 ... Node N: x_N After AllReduce: Node 1: x_1 + x_2 + ... + x_N Node 2: x_1 + x_2 + ... + x_N ... Node N: x_1 + x_2 + ... + x_N
5.2 Naive AllReduce: O(N) bandwidth#
Master node aggregator pattern:
- Each node sends x_i to master: N transfers
- Master sums, broadcasts result: N transfers
- Total bandwidth per node: 2 messages
Worse: master becomes bottleneck.
5.3 Ring AllReduce#
O(1) bandwidth per node (theoretically). Ring topology:
N1 → N2 → N3 → ... → N1 (ring)
Algorithm:
- Reduce-scatter phase: each node sends 1/N of its data around ring, accumulating sums
- All-gather phase: nodes broadcast their sum chunks around
Total: 2 × (N-1) × M/N data per node (M = total message size). For large M, this is 2M data transferred per node regardless of N.
Better than master: no central bottleneck, predictable.
5.4 NCCL (NVIDIA Collective Communication Library)#
NVIDIA-optimized collective ops:
- AllReduce, AllGather, ReduceScatter, Broadcast
- GPU-direct (PCIe, NVLink, NVSwitch)
- IB/RoCE for multi-node
- Custom ring/tree algorithms hardware-aware
5.5 Bandwidth math#
H100 NVLink: 900 GB/s peer-to-peer.
H100 IB (multi-node): 400 GB/s.
Llama-3-8B gradient sync (32 GB bf16): NVLink 35 ms, IB 80 ms per step.
Per training step: forward 100 ms + backward 200 ms + AllReduce 50 ms = 350 ms. AllReduce ~%14 overhead acceptable.
5.6 NCCL setup#
import torch.distributed as dist dist.init_process_group(backend='nccl', init_method='env://')
Env vars: MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK.
python
# DDP production-grade PyTorchimport osimport torchimport torch.nn as nnimport torch.optim as optimimport torch.distributed as distfrom torch.nn.parallel import DistributedDataParallel as DDPfrom torch.utils.data import DataLoader, DistributedSampler def setup(): dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() torch.cuda.set_device(rank) return rank, world_size def cleanup(): dist.destroy_process_group() def train(model, train_dataset, num_epochs=10, batch_size=32): rank, world_size = setup() # Wrap model in DDP model = model.cuda(rank) model = DDP(model, device_ids=[rank]) # Distributed sampler ensures each GPU sees different batch sampler = DistributedSampler( train_dataset, num_replicas=world_size, rank=rank, shuffle=True, ) loader = DataLoader( train_dataset, batch_size=batch_size, sampler=sampler, num_workers=4, pin_memory=True, ) optimizer = optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95)) for epoch in range(num_epochs): sampler.set_epoch(epoch) # Important for shuffling for step, (input_ids, labels) in enumerate(loader): input_ids = input_ids.cuda(rank) labels = labels.cuda(rank) optimizer.zero_grad() logits = model(input_ids) loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1)) loss.backward() # AllReduce gradients here torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() if step % 100 == 0 and rank == 0: print(f"Epoch {epoch} Step {step}: loss={loss.item():.4f}") cleanup() if __name__ == '__main__': # Launch with: torchrun --nproc_per_node=4 train_ddp.py # MyModel + dataset definitions # train(MyModel(), MyDataset()) pass # Launch script"""# Single node, 4 GPUs:torchrun --nproc_per_node=4 train_ddp.py # Multi-node (2 nodes × 4 GPUs):# Node 0:torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=10.0.0.1 --master_port=29500 train_ddp.py# Node 1:torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr=10.0.0.1 --master_port=29500 train_ddp.py"""DDP production training script
✅ Ders 13.1 Özeti — DDP
DDP (Data Parallel): model her GPU'da replicate, mini-batch split. Forward independent per GPU, backward'da gradient AllReduce sync. Ring-allreduce O(1) bandwidth per node. NCCL NVIDIA-optimized collective ops. Memory hiç tasarruf etmez — her GPU tam model copy. 8B+ model için yetmez → FSDP/ZeRO. PyTorch DDP: . Launch: . Ders 13.2'de FSDP/ZeRO sharded training'e geçeceğiz.
torch.nn.parallel.DistributedDataParalleltorchrun --nproc_per_node=NSıradaki Ders: FSDP + ZeRO Sharded Training#
Ders 13.2: Fully Sharded Data Parallel (FSDP), ZeRO stages 1/2/3 (Rajbhandari 2020), parameter/gradient/optimizer state sharding.
Frequently Asked Questions
In practice %85-95 efficiency. AllReduce overhead non-linear slow-down. With NCCL/InfiniBand, NVLink topology — small models %95+ linear. 100+ GPU bandwidth scaling more complex.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup