DeepSpeed ZeRO Stage 1/2/3 + ZeRO-Infinity: NVMe Offload + 70B Single GPU?

ZeRO (Microsoft) — sharding'in babası, FSDP'ten önce. Stage 1 (optimizer state shard), Stage 2 (+ gradient shard), Stage 3 (+ param shard, FULL_SHARD ekvivalent). ZeRO-Infinity ile NVMe'ye spillover → 70B single GPU **theoretically mümkün** (yavaş ama mümkün). Karar matrisi: ZeRO vs FSDP — hangisi?

Şükrü Yusuf KAYA

32 dakikalık okuma

14.05.2026

İleri

DeepSpeed ZeRO Stage 1/2/3 + ZeRO-Infinity: NVMe Offload + 70B Single GPU?

1. ZeRO Stage'leri — Memory Matematik#

Notation:

N

= GPU sayısı,

Φ

= model params (örn. 70B),

K

= optimizer multiplier (AdamW: 8 bytes/param mixed-precision).

Stage	Per-GPU Memory	Communication overhead
Baseline (DDP)	2Φ + 2Φ + KΦ = 12Φ	reduce(G): Φ
ZeRO-1 (optim shard)	2Φ + 2Φ + KΦ/N	reduce(G) + gather(O update)
ZeRO-2 (+ grad shard)	2Φ + 2Φ/N + KΦ/N	reduce-scatter(G) only
ZeRO-3 (+ param shard)	2Φ/N + 2Φ/N + KΦ/N	all-gather(W) forward + reduce-scatter(G)

Llama 70B + AdamW + 8 GPU:

DDP: 12 × 70B = 840 GB / GPU → ❌
ZeRO-1: 2 + 2 + 1 = 5 × 70B = 350 GB / GPU → ❌
ZeRO-2: 2 + 0.25 + 1 = 3.25 × 70B = 227 GB / GPU → ❌
ZeRO-3: 0.25 + 0.25 + 1 = 1.5 × 70B = 105 GB / GPU → ⚠️ marjinal (H100 80GB'da gergin)
ZeRO-3 + CPU offload: ~50 GB / GPU → ✅
ZeRO-Infinity + NVMe offload: ~15 GB / GPU → ✅ (single H100 mümkün!)

json

// === ds_config.json — Llama 70B ZeRO-3 + CPU Offload (8×H100 80GB) ===
{
  "train_batch_size": 64,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 8,
 
  "bf16": { "enabled": true },
 
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "none",                  // "cpu" daha küçük GPU'lar için
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
 
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,
 
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 5e-6,
      "betas": [0.9, 0.95],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  }
}

ds_config.json — Llama 70B ZeRO-3 + CPU offload

2. FSDP vs ZeRO — 2026 Karar Matrisi#

Kriter	FSDP	DeepSpeed ZeRO
PyTorch native	✅ yes	❌ extra dep
API simplicity	yüksek (FSDP2)	orta (config file zorunlu)
Speed	aşağı-yukarı aynı	aşağı-yukarı aynı
NVMe offload	sınırlı	ZeRO-Infinity native
CPU offload	sınırlı	rahat
3D parallelism	+ Megatron-LM	+ Megatron-DeepSpeed integration
HF Trainer integration	✅ native	✅ native
Reproducibility	iyi	iyi
2026 momentum	↑↑ (Meta FAIR push)	↔ (stable, mature)