DeepSpeed ZeRO Stage 1/2/3 + ZeRO-Infinity: NVMe Offload + 70B on Single GPU?

ZeRO (Microsoft) — father of sharding, predates FSDP. Stage 1 (optimizer shard), 2 (+ gradient), 3 (+ param, FULL_SHARD equivalent). ZeRO-Infinity NVMe spillover → 70B single GPU theoretically possible (slow but possible). Decision matrix: ZeRO vs FSDP.

Şükrü Yusuf KAYA

32 min read

5/14/2026

Advanced

DeepSpeed ZeRO Stage 1/2/3 + ZeRO-Infinity: NVMe Offload + 70B Single GPU?

1. ZeRO Stage'leri — Memory Matematik#

Notation:

N

= GPU sayısı,

Φ

= model params (örn. 70B),

K

= optimizer multiplier (AdamW: 8 bytes/param mixed-precision).

Stage	Per-GPU Memory	Communication overhead
Baseline (DDP)	2Φ + 2Φ + KΦ = 12Φ	reduce(G): Φ
ZeRO-1 (optim shard)	2Φ + 2Φ + KΦ/N	reduce(G) + gather(O update)
ZeRO-2 (+ grad shard)	2Φ + 2Φ/N + KΦ/N	reduce-scatter(G) only
ZeRO-3 (+ param shard)	2Φ/N + 2Φ/N + KΦ/N	all-gather(W) forward + reduce-scatter(G)

Llama 70B + AdamW + 8 GPU:

DDP: 12 × 70B = 840 GB / GPU → ❌
ZeRO-1: 2 + 2 + 1 = 5 × 70B = 350 GB / GPU → ❌
ZeRO-2: 2 + 0.25 + 1 = 3.25 × 70B = 227 GB / GPU → ❌
ZeRO-3: 0.25 + 0.25 + 1 = 1.5 × 70B = 105 GB / GPU → ⚠️ marjinal (H100 80GB'da gergin)
ZeRO-3 + CPU offload: ~50 GB / GPU → ✅
ZeRO-Infinity + NVMe offload: ~15 GB / GPU → ✅ (single H100 mümkün!)

json

// === ds_config.json — Llama 70B ZeRO-3 + CPU Offload (8×H100 80GB) ===
{
  "train_batch_size": 64,
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 8,
 
  "bf16": { "enabled": true },
 
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "none",                  // "cpu" daha küçük GPU'lar için
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
 
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,
 
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 5e-6,
      "betas": [0.9, 0.95],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  }
}

ds_config.json — Llama 70B ZeRO-3 + CPU offload

2. FSDP vs ZeRO — 2026 Karar Matrisi#

Kriter	FSDP	DeepSpeed ZeRO
PyTorch native	✅ yes	❌ extra dep
API simplicity	yüksek (FSDP2)	orta (config file zorunlu)
Speed	aşağı-yukarı aynı	aşağı-yukarı aynı
NVMe offload	sınırlı	ZeRO-Infinity native
CPU offload	sınırlı	rahat
3D parallelism	+ Megatron-LM	+ Megatron-DeepSpeed integration
HF Trainer integration	✅ native	✅ native
Reproducibility	iyi	iyi
2026 momentum	↑↑ (Meta FAIR push)	↔ (stable, mature)

Cookbook'un kuralı:

2026'da yeni başlıyorsan → FSDP2 (PyTorch native, momentum)
NVMe offload mecbur → ZeRO-Infinity (FSDP'te hâlâ erken)
Var olan eğitim pipeline'ın varsa → mevcut kal

✅ Teslim

DeepSpeed config'le bir 70B mini-run. 2) FSDP eşdeğer ile karşılaştır. 3) Sonraki ders: 4.4 — Tensor Parallelism (Megatron).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

DeepSpeed ZeRO Stage 1/2/3 + ZeRO-Infinity: NVMe Offload + 70B on Single GPU?

1. ZeRO Stage'leri — Memory Matematik#

2. FSDP vs ZeRO — 2026 Karar Matrisi#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter