Pipeline Parallelism: GPipe + 1F1B + Interleaved — Bubble Overhead Math

Pipeline Parallel: model layers distributed across GPUs. Forward+Backward streamed. GPipe (simple + bubble overhead), 1F1B (memory efficient), Interleaved 1F1B (Megatron, halves bubble). 70B + 4-node × 8 GPU scenario.

Şükrü Yusuf KAYA

34 min read

5/14/2026

Advanced

Pipeline Parallelism: GPipe + 1F1B + Interleaved — Bubble Overhead Matematiği

1. Pipeline Parallel — Layer-by-Layer Split#

Layer 0-19  on GPU0   (stage 0)
Layer 20-39 on GPU1   (stage 1)
Layer 40-59 on GPU2   (stage 2)
Layer 60-79 on GPU3   (stage 3)

Forward:
  GPU0 [L0..L19] → activation → GPU1
  GPU1 [L20..L39] → activation → GPU2
  GPU2 [L40..L59] → activation → GPU3
  GPU3 [L60..L79] → loss

Backward (gradient):
  GPU3 → grad → GPU2 → grad → GPU1 → grad → GPU0

Problem: Naïve pipeline'da bir mini-batch'i tamamen forward edip sonra backward → diğer GPU'lar idle.

Çözüm: Mini-batch'i

m

micro-batch'e böl, pipeline stream'le.

2. Bubble Overhead Matematiği#

bubble_fraction = (p - 1) / m

p
= pipeline stage sayısı (GPU sayısı PP boyutunda)
m
= micro-batch sayısı

Örnek: p=4 stage, m=8 micro-batch → bubble = 3/8 = %37.5 boşa giden compute.

Optimize:

m

yüksek tut → bubble düşer. Ama

m

çok yüksek → her micro-batch çok küçük → GPU underutilized.

1F1B (Megatron):

Klasik GPipe: tüm forward'ları yap, sonra tüm backward'ları yap → activation memory büyük (m × layer_act)
1F1B (One Forward One Backward): her forward'tan sonra backward yap → activation peak = O(p)
Daha az memory + aynı bubble

Interleaved 1F1B (Megatron-LM):

Her stage'i sub-stage'lere böl (örn. layer 0-9 ve 80-89 aynı GPU'da)
Bubble fraction'ı yarı'ya indirir

python

# === Megatron-LM Pipeline Parallel — Llama 70B PP=4 TP=2 (8×H100 1-node) ===
# 70B model 8 GPU, TP=2 × PP=4
 
torchrun --nproc_per_node=8 --nnodes=1 \
    pretrain_gpt.py \
    --tensor-model-parallel-size 2 \
    --pipeline-model-parallel-size 4 \
    --num-layers-per-virtual-pipeline-stage 5 \   # Interleaved 1F1B
    --num-layers 80 --hidden-size 8192 --num-attention-heads 64 \
    --seq-length 4096 \
    --micro-batch-size 1 --global-batch-size 64 \
    # global=64 → micro_batches per step = 64 / (DP=1) = 64
    --lr 5e-6 --train-iters 5000 \
    --bf16
 
# Bench (Llama 70B + 8×H100):
# - TP=8 PP=1: 1.85 step/s  (intra-node)
# - TP=2 PP=4: 1.42 step/s  (more comm, more bubble)
# - TP=4 PP=2: 1.65 step/s
#
# TP-only her zaman daha hızlı tek-node'da. Pipeline asıl avantaj
# **multi-node**'da: TP intra-node, PP inter-node.

Megatron PP+TP — 8×H100

3. Multi-Node — TP × PP × DP = 3D Parallelism#

4-node × 8 GPU = 32 GPU
TP = 4 (intra-node, NVLink)
PP = 4 (inter-node, IB)
DP = 2 (replication for batch scaling)

4 × 4 × 2 = 32

Bu cookbook'un en advanced setup'ı. Open-source paradigmal yetersiz; NVIDIA Megatron-LM ya da DeepSpeed-Megatron birleşik kullanır.

Cookbook'un 2026 tavsiyesi:

4090 yerine cloud 8×H100 → tek-node FSDP + TP yeter
Multi-node 70B FT istiyorsan → cloud'da paylan, ama 16+ saat süreceğini bekle
405B+ FT → Lambda Reserve veya CoreWeave 2-yr reserve, $10K-50K aylık

✅ Teslim

Megatron-LM PP example'ını oku. 2) Bubble fraction'ı kâğıt-kalem hesapla. 3) Sonraki ders: 4.6 — Sequence + Context Parallel.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Pipeline Parallelism: GPipe + 1F1B + Interleaved — Bubble Overhead Math

1. Pipeline Parallel — Layer-by-Layer Split#

2. Bubble Overhead Matematiği#

3. Multi-Node — TP × PP × DP = 3D Parallelism#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter