İçeriğe geç

Data Parallelism (DDP): Multi-GPU LLM Training'in Temeli — AllReduce ve NCCL Anatomi

Distributed Data Parallel (DDP) anatomi: model replication across GPUs, mini-batch split, forward/backward independent per GPU, gradient AllReduce synchronization. NCCL (NVIDIA Collective Communication Library), ring-allreduce algorithm, bandwidth math. PyTorch DDP API, launch scripts, common pitfalls (uneven batches, batch norm sync).

Şükrü Yusuf KAYA
70 dakikalık okuma
İleri
Data Parallelism (DDP): Multi-GPU LLM Training'in Temeli — AllReduce ve NCCL Anatomi
🔗 DDP — modern LLM training'in temel taşı
Llama-3-8B'i tek H100'de eğitemezsin (8B param + activations + optimizer state = 80GB+ → tek GPU yetmez). 64+ GPU lazım. Distributed training paradigm'i 3 ana approach: data parallel (DDP), model parallel (TP, PP), sharded (FSDP, ZeRO). Bu derste data parallel'in temel taşı: model her GPU'da replicate, mini-batch split. Gradients AllReduce ile sync. NCCL bu sync'i optimize eden NVIDIA library — ring-allreduce algorithm. 70 dakika sonra: DDP'nin matematiksel ve sistemsel anatomisini, PyTorch DDP API'sini, production launch scripts'ini, common pitfalls'ı derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. Tek GPU sınırı — niye distributed gerekli
  2. DDP yaklaşımı — model replicate + batch split
  3. Forward pass independent — paralel compute
  4. Backward gradient sync — AllReduce primitives
  5. NCCL ve ring-allreduce — efficient bandwidth
  6. PyTorch DDP API — DDP class, init
  7. Launch scripts — torchrun, slurm
  8. Effective batch size — DDP scaling
  9. Common pitfalls — uneven batches, BN sync
  10. DDP limit — niye DDP yetmiyor 7B+

1-4. DDP Architecture#

1.1 Tek GPU sınırı#

Llama-3-8B training memory needs:
  • Params: 8B × 4 byte (fp32) = 32 GB
  • Gradients: 8B × 4 byte = 32 GB
  • Optimizer state (AdamW: m + v): 2 × 8B × 4 = 64 GB
  • Activations: 2048 × 8192 × 32 layers × 2 byte = 1 GB (bf16, attention saves)
  • Total: ~130 GB
H100 80GB single GPU yetmez! Birden fazla GPU lazım.

1.2 DDP yaklaşımı#

Her GPU tam model copy tutar. Mini-batch'i N GPU'ya böl:
Batch = 32 sequences 4 GPU → her GPU 8 sequence işler
Forward pass independent — her GPU kendi mini-batch'inde compute.

1.3 Backward gradient sync#

Forward → backward → gradient hesabı her GPU'da. Key: gradient'leri sync et — her GPU average gradient'i kullanmalı.
# Per-GPU local gradient g_local_i = backward() # AllReduce: sum all gradients across GPUs, divide by N g_global = AllReduce(g_local_i) / N # Optimizer update with synced gradient weights -= lr * g_global

1.4 Mathematical equivalence#

DDP loss:
L = sum over all samples / total batch size = (sum per-GPU losses) / total batch = mean of per-GPU losses (when per-GPU batch equal)
Gradient: ∇L = average of per-GPU gradients. AllReduce mean = correct gradient.

1.5 Memory not saved#

DDP memory hiç tasarruf etmez — her GPU tam model + gradient + optimizer state.
DDP only saves time (parallel compute) ama 8B model 130 GB single GPU memory hâlâ engelliyor.
Çözüm: FSDP/ZeRO (Ders 13.2).

5. AllReduce + NCCL#

5.1 AllReduce primitive#

AllReduce: N node'da local value'lar, hepsi sum (veya mean) almak istiyor.
Node 1: x_1 Node 2: x_2 ... Node N: x_N After AllReduce: Node 1: x_1 + x_2 + ... + x_N Node 2: x_1 + x_2 + ... + x_N ... Node N: x_1 + x_2 + ... + x_N

5.2 Naive AllReduce: O(N) bandwidth#

Master node aggregator pattern:
  1. Each node sends x_i to master: N transfers
  2. Master sums, broadcasts result: N transfers
  3. Total bandwidth per node: 2 messages
Worse: master becomes bottleneck.

5.3 Ring AllReduce#

O(1) bandwidth per node (theoretically). Ring topology:
N1 → N2 → N3 → ... → N1 (ring)
Algorithm:
  1. Reduce-scatter phase: each node sends 1/N of its data around ring, accumulating sums
  2. All-gather phase: nodes broadcast their sum chunks around
Total: 2 × (N-1) × M/N data per node (M = total message size). For large M, this is 2M data transferred per node regardless of N.
Better than master: no central bottleneck, predictable.

5.4 NCCL (NVIDIA Collective Communication Library)#

NVIDIA-optimized collective ops:
  • AllReduce, AllGather, ReduceScatter, Broadcast
  • GPU-direct (PCIe, NVLink, NVSwitch)
  • IB/RoCE for multi-node
  • Custom ring/tree algorithms hardware-aware

5.5 Bandwidth math#

H100 NVLink: 900 GB/s peer-to-peer. H100 IB (multi-node): 400 GB/s. Llama-3-8B gradient sync (32 GB bf16): NVLink 35 ms, IB 80 ms per step.
Per training step: forward 100 ms + backward 200 ms + AllReduce 50 ms = 350 ms. AllReduce ~%14 overhead acceptable.

5.6 NCCL setup#

import torch.distributed as dist dist.init_process_group(backend='nccl', init_method='env://')
Env vars: MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK.
python
# DDP production-grade PyTorch
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
 
def setup():
dist.init_process_group(backend='nccl')
rank = dist.get_rank()
world_size = dist.get_world_size()
torch.cuda.set_device(rank)
return rank, world_size
 
def cleanup():
dist.destroy_process_group()
 
def train(model, train_dataset, num_epochs=10, batch_size=32):
rank, world_size = setup()
# Wrap model in DDP
model = model.cuda(rank)
model = DDP(model, device_ids=[rank])
# Distributed sampler ensures each GPU sees different batch
sampler = DistributedSampler(
train_dataset,
num_replicas=world_size,
rank=rank,
shuffle=True,
)
loader = DataLoader(
train_dataset,
batch_size=batch_size,
sampler=sampler,
num_workers=4,
pin_memory=True,
)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95))
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # Important for shuffling
for step, (input_ids, labels) in enumerate(loader):
input_ids = input_ids.cuda(rank)
labels = labels.cuda(rank)
optimizer.zero_grad()
logits = model(input_ids)
loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
loss.backward() # AllReduce gradients here
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 100 == 0 and rank == 0:
print(f"Epoch {epoch} Step {step}: loss={loss.item():.4f}")
cleanup()
 
if __name__ == '__main__':
# Launch with: torchrun --nproc_per_node=4 train_ddp.py
# MyModel + dataset definitions
# train(MyModel(), MyDataset())
pass
 
# Launch script
"""
# Single node, 4 GPUs:
torchrun --nproc_per_node=4 train_ddp.py
 
# Multi-node (2 nodes × 4 GPUs):
# Node 0:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=10.0.0.1 --master_port=29500 train_ddp.py
# Node 1:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr=10.0.0.1 --master_port=29500 train_ddp.py
"""
DDP production training script
✅ Ders 13.1 Özeti — DDP
DDP (Data Parallel): model her GPU'da replicate, mini-batch split. Forward independent per GPU, backward'da gradient AllReduce sync. Ring-allreduce O(1) bandwidth per node. NCCL NVIDIA-optimized collective ops. Memory hiç tasarruf etmez — her GPU tam model copy. 8B+ model için yetmez → FSDP/ZeRO. PyTorch DDP:
torch.nn.parallel.DistributedDataParallel
. Launch:
torchrun --nproc_per_node=N
. Ders 13.2'de FSDP/ZeRO sharded training'e geçeceğiz.

Sıradaki Ders: FSDP + ZeRO Sharded Training#

Ders 13.2: Fully Sharded Data Parallel (FSDP), ZeRO stages 1/2/3 (Rajbhandari 2020), parameter/gradient/optimizer state sharding.

Sık Sorulan Sorular

Pratikte %85-95 efficiency. AllReduce overhead linear olmayan slow-down. NCCL/InfiniBand kullan, NVLink topology — small model %95+ linear. 100+ GPU bandwidth scaling daha karmaşık.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler