Skip to content

DeepSpeed ZeRO Stage 1/2/3 + ZeRO-Infinity: NVMe Offload + 70B on Single GPU?

ZeRO (Microsoft) — father of sharding, predates FSDP. Stage 1 (optimizer shard), 2 (+ gradient), 3 (+ param, FULL_SHARD equivalent). ZeRO-Infinity NVMe spillover → 70B single GPU theoretically possible (slow but possible). Decision matrix: ZeRO vs FSDP.

Şükrü Yusuf KAYA
32 min read
Advanced
DeepSpeed ZeRO Stage 1/2/3 + ZeRO-Infinity: NVMe Offload + 70B Single GPU?

1. ZeRO Stage'leri — Memory Matematik#

Notation:
N
= GPU sayısı,
Φ
= model params (örn. 70B),
K
= optimizer multiplier (AdamW: 8 bytes/param mixed-precision).
StagePer-GPU MemoryCommunication overhead
Baseline (DDP)2Φ + 2Φ + KΦ = 12Φreduce(G): Φ
ZeRO-1 (optim shard)2Φ + 2Φ + KΦ/Nreduce(G) + gather(O update)
ZeRO-2 (+ grad shard)2Φ + 2Φ/N + KΦ/Nreduce-scatter(G) only
ZeRO-3 (+ param shard)2Φ/N + 2Φ/N + KΦ/Nall-gather(W) forward + reduce-scatter(G)
Llama 70B + AdamW + 8 GPU:
  • DDP: 12 × 70B = 840 GB / GPU → ❌
  • ZeRO-1: 2 + 2 + 1 = 5 × 70B = 350 GB / GPU → ❌
  • ZeRO-2: 2 + 0.25 + 1 = 3.25 × 70B = 227 GB / GPU → ❌
  • ZeRO-3: 0.25 + 0.25 + 1 = 1.5 × 70B = 105 GB / GPU → ⚠️ marjinal (H100 80GB'da gergin)
  • ZeRO-3 + CPU offload: ~50 GB / GPU → ✅
  • ZeRO-Infinity + NVMe offload: ~15 GB / GPU → ✅ (single H100 mümkün!)
json
// === ds_config.json — Llama 70B ZeRO-3 + CPU Offload (8×H100 80GB) ===
{
"train_batch_size": 64,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
 
"bf16": { "enabled": true },
 
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "none", // "cpu" daha küçük GPU'lar için
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": 5e8,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
 
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
 
"optimizer": {
"type": "AdamW",
"params": {
"lr": 5e-6,
"betas": [0.9, 0.95],
"eps": 1e-8,
"weight_decay": 0.01
}
}
}
ds_config.json — Llama 70B ZeRO-3 + CPU offload

2. FSDP vs ZeRO — 2026 Karar Matrisi#

KriterFSDPDeepSpeed ZeRO
PyTorch native✅ yes❌ extra dep
API simplicityyüksek (FSDP2)orta (config file zorunlu)
Speedaşağı-yukarı aynıaşağı-yukarı aynı
NVMe offloadsınırlıZeRO-Infinity native
CPU offloadsınırlırahat
3D parallelism+ Megatron-LM+ Megatron-DeepSpeed integration
HF Trainer integration✅ native✅ native
Reproducibilityiyiiyi
2026 momentum↑↑ (Meta FAIR push)↔ (stable, mature)
Cookbook'un kuralı:
  • 2026'da yeni başlıyorsan → FSDP2 (PyTorch native, momentum)
  • NVMe offload mecbur → ZeRO-Infinity (FSDP'te hâlâ erken)
  • Var olan eğitim pipeline'ın varsa → mevcut kal
✅ Teslim
  1. DeepSpeed config'le bir 70B mini-run. 2) FSDP eşdeğer ile karşılaştır. 3) Sonraki ders: 4.4 — Tensor Parallelism (Megatron).

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content