Pipeline Parallelism: GPipe + 1F1B + Interleaved — Bubble Overhead Math
Pipeline Parallel: model layers distributed across GPUs. Forward+Backward streamed. GPipe (simple + bubble overhead), 1F1B (memory efficient), Interleaved 1F1B (Megatron, halves bubble). 70B + 4-node × 8 GPU scenario.
Şükrü Yusuf KAYA
34 min read
Advanced1. Pipeline Parallel — Layer-by-Layer Split#
Layer 0-19 on GPU0 (stage 0) Layer 20-39 on GPU1 (stage 1) Layer 40-59 on GPU2 (stage 2) Layer 60-79 on GPU3 (stage 3) Forward: GPU0 [L0..L19] → activation → GPU1 GPU1 [L20..L39] → activation → GPU2 GPU2 [L40..L59] → activation → GPU3 GPU3 [L60..L79] → loss Backward (gradient): GPU3 → grad → GPU2 → grad → GPU1 → grad → GPU0
Problem: Naïve pipeline'da bir mini-batch'i tamamen forward edip sonra backward → diğer GPU'lar idle.
Çözüm: Mini-batch'i micro-batch'e böl, pipeline stream'le.
m2. Bubble Overhead Matematiği#
bubble_fraction = (p - 1) / m
- = pipeline stage sayısı (GPU sayısı PP boyutunda)
p - = micro-batch sayısı
m
Örnek: p=4 stage, m=8 micro-batch → bubble = 3/8 = %37.5 boşa giden compute.
Optimize: yüksek tut → bubble düşer. Ama çok yüksek → her micro-batch çok küçük → GPU underutilized.
mm1F1B (Megatron):
- Klasik GPipe: tüm forward'ları yap, sonra tüm backward'ları yap → activation memory büyük (m × layer_act)
- 1F1B (One Forward One Backward): her forward'tan sonra backward yap → activation peak = O(p)
- Daha az memory + aynı bubble
Interleaved 1F1B (Megatron-LM):
- Her stage'i sub-stage'lere böl (örn. layer 0-9 ve 80-89 aynı GPU'da)
- Bubble fraction'ı yarı'ya indirir
python
# === Megatron-LM Pipeline Parallel — Llama 70B PP=4 TP=2 (8×H100 1-node) ===# 70B model 8 GPU, TP=2 × PP=4 torchrun --nproc_per_node=8 --nnodes=1 \ pretrain_gpt.py \ --tensor-model-parallel-size 2 \ --pipeline-model-parallel-size 4 \ --num-layers-per-virtual-pipeline-stage 5 \ # Interleaved 1F1B --num-layers 80 --hidden-size 8192 --num-attention-heads 64 \ --seq-length 4096 \ --micro-batch-size 1 --global-batch-size 64 \ # global=64 → micro_batches per step = 64 / (DP=1) = 64 --lr 5e-6 --train-iters 5000 \ --bf16 # Bench (Llama 70B + 8×H100):# - TP=8 PP=1: 1.85 step/s (intra-node)# - TP=2 PP=4: 1.42 step/s (more comm, more bubble)# - TP=4 PP=2: 1.65 step/s## TP-only her zaman daha hızlı tek-node'da. Pipeline asıl avantaj# **multi-node**'da: TP intra-node, PP inter-node.Megatron PP+TP — 8×H100
3. Multi-Node — TP × PP × DP = 3D Parallelism#
4-node × 8 GPU = 32 GPU TP = 4 (intra-node, NVLink) PP = 4 (inter-node, IB) DP = 2 (replication for batch scaling) 4 × 4 × 2 = 32
Bu cookbook'un en advanced setup'ı. Open-source paradigmal yetersiz; NVIDIA Megatron-LM ya da DeepSpeed-Megatron birleşik kullanır.
Cookbook'un 2026 tavsiyesi:
- 4090 yerine cloud 8×H100 → tek-node FSDP + TP yeter
- Multi-node 70B FT istiyorsan → cloud'da paylan, ama 16+ saat süreceğini bekle
- 405B+ FT → Lambda Reserve veya CoreWeave 2-yr reserve, $10K-50K aylık
✅ Teslim
- Megatron-LM PP example'ını oku. 2) Bubble fraction'ı kâğıt-kalem hesapla. 3) Sonraki ders: 4.6 — Sequence + Context Parallel.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations