Is linear scaling possible with DDP (4x GPU = 4x speed)?

Data Parallelism (DDP): Foundation of Multi-GPU LLM Training — AllReduce and NCCL Anatomy

Distributed Data Parallel (DDP) anatomy: model replication across GPUs, mini-batch split, forward/backward independent per GPU, gradient AllReduce synchronization. NCCL (NVIDIA Collective Communication Library), ring-allreduce algorithm, bandwidth math. PyTorch DDP API, launch scripts, common pitfalls (uneven batches, batch norm sync).

Şükrü Yusuf KAYA

70 min read

5/13/2026

Advanced

Data Parallelism (DDP): Multi-GPU LLM Training'in Temeli — AllReduce ve NCCL Anatomi

🔗 DDP — modern LLM training'in temel taşı

Llama-3-8B'i tek H100'de eğitemezsin (8B param + activations + optimizer state = 80GB+ → tek GPU yetmez). 64+ GPU lazım. Distributed training paradigm'i 3 ana approach: data parallel (DDP), model parallel (TP, PP), sharded (FSDP, ZeRO). Bu derste data parallel'in temel taşı: model her GPU'da replicate, mini-batch split. Gradients AllReduce ile sync. NCCL bu sync'i optimize eden NVIDIA library — ring-allreduce algorithm. 70 dakika sonra: DDP'nin matematiksel ve sistemsel anatomisini, PyTorch DDP API'sini, production launch scripts'ini, common pitfalls'ı derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

Tek GPU sınırı — niye distributed gerekli
DDP yaklaşımı — model replicate + batch split
Forward pass independent — paralel compute
Backward gradient sync — AllReduce primitives
NCCL ve ring-allreduce — efficient bandwidth
PyTorch DDP API — DDP class, init
Launch scripts — torchrun, slurm
Effective batch size — DDP scaling
Common pitfalls — uneven batches, BN sync
DDP limit — niye DDP yetmiyor 7B+

1-4. DDP Architecture#

1.1 Tek GPU sınırı#

Llama-3-8B training memory needs:

Params: 8B × 4 byte (fp32) = 32 GB
Gradients: 8B × 4 byte = 32 GB
Optimizer state (AdamW: m + v): 2 × 8B × 4 = 64 GB
Activations: 2048 × 8192 × 32 layers × 2 byte = 1 GB (bf16, attention saves)
Total: ~130 GB

H100 80GB single GPU yetmez! Birden fazla GPU lazım.

1.2 DDP yaklaşımı#

Her GPU tam model copy tutar. Mini-batch'i N GPU'ya böl:

Batch = 32 sequences
4 GPU → her GPU 8 sequence işler

Forward pass independent — her GPU kendi mini-batch'inde compute.

1.3 Backward gradient sync#

Forward → backward → gradient hesabı her GPU'da. Key: gradient'leri sync et — her GPU average gradient'i kullanmalı.

# Per-GPU local gradient
g_local_i = backward()

# AllReduce: sum all gradients across GPUs, divide by N
g_global = AllReduce(g_local_i) / N

# Optimizer update with synced gradient
weights -= lr * g_global

1.4 Mathematical equivalence#

DDP loss:

L = sum over all samples / total batch size
  = (sum per-GPU losses) / total batch
  = mean of per-GPU losses (when per-GPU batch equal)

Gradient: ∇L = average of per-GPU gradients. AllReduce mean = correct gradient.

1.5 Memory not saved#

DDP memory hiç tasarruf etmez — her GPU tam model + gradient + optimizer state.

DDP only saves time (parallel compute) ama 8B model 130 GB single GPU memory hâlâ engelliyor.

Çözüm: FSDP/ZeRO (Ders 13.2).

5. AllReduce + NCCL#

5.1 AllReduce primitive#

AllReduce: N node'da local value'lar, hepsi sum (veya mean) almak istiyor.

Node 1: x_1
Node 2: x_2
...
Node N: x_N

After AllReduce:
Node 1: x_1 + x_2 + ... + x_N
Node 2: x_1 + x_2 + ... + x_N
...
Node N: x_1 + x_2 + ... + x_N

5.2 Naive AllReduce: O(N) bandwidth#

Master node aggregator pattern:

Each node sends x_i to master: N transfers
Master sums, broadcasts result: N transfers
Total bandwidth per node: 2 messages

Worse: master becomes bottleneck.

5.3 Ring AllReduce#

O(1) bandwidth per node (theoretically). Ring topology:

N1 → N2 → N3 → ... → N1 (ring)

Algorithm:

Reduce-scatter phase: each node sends 1/N of its data around ring, accumulating sums
All-gather phase: nodes broadcast their sum chunks around

Total: 2 × (N-1) × M/N data per node (M = total message size). For large M, this is 2M data transferred per node regardless of N.

Better than master: no central bottleneck, predictable.

5.4 NCCL (NVIDIA Collective Communication Library)#

NVIDIA-optimized collective ops:

AllReduce, AllGather, ReduceScatter, Broadcast
GPU-direct (PCIe, NVLink, NVSwitch)
IB/RoCE for multi-node
Custom ring/tree algorithms hardware-aware

5.5 Bandwidth math#

H100 NVLink: 900 GB/s peer-to-peer. H100 IB (multi-node): 400 GB/s. Llama-3-8B gradient sync (32 GB bf16): NVLink 35 ms, IB 80 ms per step.

Per training step: forward 100 ms + backward 200 ms + AllReduce 50 ms = 350 ms. AllReduce ~%14 overhead acceptable.

5.6 NCCL setup#

import torch.distributed as dist
dist.init_process_group(backend='nccl', init_method='env://')

Env vars: MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK.

python

# DDP production-grade PyTorch
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, DistributedSampler
 
def setup():
    dist.init_process_group(backend='nccl')
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    torch.cuda.set_device(rank)
    return rank, world_size
 
def cleanup():
    dist.destroy_process_group()
 
def train(model, train_dataset, num_epochs=10, batch_size=32):
    rank, world_size = setup()
    
    # Wrap model in DDP
    model = model.cuda(rank)
    model = DDP(model, device_ids=[rank])
    
    # Distributed sampler ensures each GPU sees different batch
    sampler = DistributedSampler(
        train_dataset,
        num_replicas=world_size,
        rank=rank,
        shuffle=True,
    )
    
    loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        sampler=sampler,
        num_workers=4,
        pin_memory=True,
    )
    
    optimizer = optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95))
    
    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Important for shuffling
        
        for step, (input_ids, labels) in enumerate(loader):
            input_ids = input_ids.cuda(rank)
            labels = labels.cuda(rank)
            
            optimizer.zero_grad()
            logits = model(input_ids)
            loss = nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
            loss.backward()    # AllReduce gradients here
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            if step % 100 == 0 and rank == 0:
                print(f"Epoch {epoch} Step {step}: loss={loss.item():.4f}")
    
    cleanup()
 
if __name__ == '__main__':
    # Launch with: torchrun --nproc_per_node=4 train_ddp.py
    # MyModel + dataset definitions
    # train(MyModel(), MyDataset())
    pass
 
# Launch script
"""
# Single node, 4 GPUs:
torchrun --nproc_per_node=4 train_ddp.py
 
# Multi-node (2 nodes × 4 GPUs):
# Node 0:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr=10.0.0.1 --master_port=29500 train_ddp.py
# Node 1:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr=10.0.0.1 --master_port=29500 train_ddp.py
"""

DDP production training script

✅ Ders 13.1 Özeti — DDP

DDP (Data Parallel): model her GPU'da replicate, mini-batch split. Forward independent per GPU, backward'da gradient AllReduce sync. Ring-allreduce O(1) bandwidth per node. NCCL NVIDIA-optimized collective ops. Memory hiç tasarruf etmez — her GPU tam model copy. 8B+ model için yetmez → FSDP/ZeRO. PyTorch DDP:

torch.nn.parallel.DistributedDataParallel

. Launch:

torchrun --nproc_per_node=N

. Ders 13.2'de FSDP/ZeRO sharded training'e geçeceğiz.

Sıradaki Ders: FSDP + ZeRO Sharded Training#

Ders 13.2: Fully Sharded Data Parallel (FSDP), ZeRO stages 1/2/3 (Rajbhandari 2020), parameter/gradient/optimizer state sharding.

Frequently Asked Questions

In practice %85-95 efficiency. AllReduce overhead non-linear slow-down. With NCCL/InfiniBand, NVLink topology — small models %95+ linear. 100+ GPU bandwidth scaling more complex.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...