3D Parallelism: Tensor + Pipeline + Data Parallel — Training Llama-3 70B and 405B
Frontier LLM training: Megatron-LM's 3D parallelism. Tensor Parallelism (Shoeybi 2019) — matrix splits across GPUs. Pipeline Parallelism (Huang 2018) — layer splits + bubble optimization. Combined 3D: DP × TP × PP. Llama-3 70B (DP=192, TP=8, PP=16). Communication patterns, optimization, capstone implementation outline.
Şükrü Yusuf KAYA
80 min read
Advanced🌐 3D Parallelism — frontier LLM training'in zirvesi
Llama-3-8B FSDP'le bir node'da OK. Ama Llama-3-70B? 405B? FSDP yetmez. Activation memory, communication bandwidth, optimizer state — hepsi farklı şekillerde shard edilmeli. Çözüm: 3D parallelism — Tensor Parallel (TP, Megatron-LM 2019), Pipeline Parallel (PP, GPipe 2018), Data Parallel (DP). Combine: model her dimension'da shard. Llama-3-70B: DP=192 × TP=8 × PP=16 = 24576 GPU. Frontier modellerin nasıl eğitildiğinin gizli mimarisi. 80 dakika sonra: TP/PP/DP matematiksel anatomisini, Megatron-LM yaklaşımını, 70B+ training pipeline'ını kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- FSDP'nin sınırı — 70B+ niye yetmez
- Tensor Parallelism (TP) — Shoeybi 2019 Megatron-LM
- TP communication — all-reduce per layer
- Pipeline Parallelism (PP) — Huang 2018 GPipe
- Pipeline bubble — pipeline efficiency
- Micro-batching — bubble mitigation
- 3D combine — DP × TP × PP
- Llama-3-70B setup — actual configuration
- Communication topology — NVLink + IB
- Future: 4D parallelism — sequence parallel
2-8. TP + PP + 3D Math#
2.1 Tensor Parallel (Megatron-LM)#
Shoeybi 2019: 'Megatron-LM' — single transformer layer'ı multiple GPU'ya split.
FFN example:
FFN(x) = (x W_1) W_2
2-way TP: W_1 column-shard, W_2 row-shard.
- GPU 0: W_1[:, /2], W_2[/2, :]
- GPU 1: W_1[:, d/2:], W_2[d/2:, :]
Forward: each GPU computes half. AllReduce at end.
Math:
out = (x W_1[:, :d/2]) W_2[:d/2, :] + (x W_1[:, d/2:]) W_2[d/2:, :] = AllReduce(per-GPU partial outputs)
2.2 TP attention#
Multi-head attention shard: each GPU handles subset of heads.
Llama-3 32 heads, TP=8 → 4 heads per GPU
Each GPU computes Q/K/V for its heads → attention → output. AllReduce final.
2.3 TP communication#
Every layer: 2 AllReduce (FFN + attention) per forward, 2 per backward. Heavy.
Must be fast network — NVLink within node ideal. Cross-node IB OK but slower.
Llama-3: TP=8 (within node, NVLink). Never cross-node TP.
2.4 Pipeline Parallel (GPipe)#
Huang 2018: 'GPipe'. Split layers across GPUs.
GPU 0: layers 0-7 GPU 1: layers 8-15 GPU 2: layers 16-23 GPU 3: layers 24-31
Forward: GPU 0 → GPU 1 → ... pipeline. Activations transfer between GPUs.
2.5 Pipeline bubble#
Sequential pipeline naive: GPU 1 idle while GPU 0 forward, GPU 0 idle while GPU 1 backward etc.
GPU 0 fwd: 1, 2, 3, 4 ... GPU 1 fwd: 1, 2, 3, 4 ... (start after GPU 0)
Bubble = idle time per pipeline stage. Inefficiency.
2.6 Micro-batching#
GPipe trick: split batch into M micro-batches. Pipeline:
Batch = 32, M=4 micro-batches each size 8 GPU 0 fwd: m1, m2, m3, m4 GPU 1 fwd: m1, m2, m3, m4 GPU 2 fwd: m1, m2, m3, m4
Bubble overhead: O(M / total_batches). Larger M → less bubble.
2.7 1F1B (one forward one backward) scheduling#
Narayanan 2021: PipeDream-1F1B. Better scheduling reduces bubble further.
Forward + backward interleaved.
2.8 3D combine#
DP × TP × PP = total GPUs.
Llama-3-70B setup (estimated):
- DP = 192 (data parallel groups)
- TP = 8 (within node, NVLink)
- PP = 16 (across nodes, layer splits)
- Total: 192 × 8 × 16 = 24,576 GPUs
2.9 Memory math#
Each GPU memory:
- TP=8: params/8 (model shard)
- PP=16: 1/16 of layers
- Combined: P / (TP × PP) = P / 128 per GPU
- DP: replicates this group across 192 groups (full memory in each DP group)
Llama-3-70B per GPU memory:
- Param shard: 70B / 128 ≈ 550M params × 2 byte = 1.1 GB
- Optimizer (AdamW): 4× = 4.4 GB
- Activations: 4-8 GB
- Total: ~12-15 GB per GPU
2.10 Communication topology#
- TP: high bandwidth needed → NVLink within node (8 GPUs)
- PP: low bandwidth → InfiniBand cross-node OK (activations small)
- DP: AllReduce gradients periodically → InfiniBand
🎉 Modül 13 Tamamlandı — Distributed Training
3 ders boyunca: DDP (data parallel temeli, AllReduce + NCCL), FSDP/ZeRO (sharded — Rajbhandari 2020, Llama-3-8B tek node), 3D parallelism (TP + PP + DP — Llama-3-70B 24K GPU). Modern frontier LLM training'in sistemsel anatomisi tamamlandı. Modül 13 envanteri: 3 ders, 225 dk. Genel müfredat: 14 modül, 80 ders, ~75 saat. Sıradaki: Modül 14 — Fine-tuning (SFT, LoRA, QLoRA) ve Modül 15 — RLHF + DPO.
Modül 13 Envanteri (Tamamlandı)#
| # | Ders | Süre |
|---|---|---|
| 13.1 | DDP + AllReduce + NCCL | 70 dk |
| 13.2 | FSDP + ZeRO (Rajbhandari 2020) | 75 dk |
| 13.3 | 3D Parallelism TP+PP+DP | 80 dk |
| Toplam | 3 ders | 225 dk (~3.75 saat) |
Frequently Asked Questions
Small (1-3B Turkish model): FSDP only (single node 8 GPU). 7B model: FSDP + maybe TP. 70B+ Turkish model: practically Llama-3 base + Turkish fine-tune (can't train from scratch, Turkish corpus limited).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup