İçeriğe geç

3D Parallelism: Tensor + Pipeline + Data Parallel — Llama-3 70B ve 405B Training

Frontier LLM training: Megatron-LM'in 3D parallelism. Tensor Parallelism (Shoeybi 2019) — matrix splits across GPUs. Pipeline Parallelism (Huang 2018) — layer splits + bubble optimization. Combined 3D: DP × TP × PP. Llama-3 70B (DP=192, TP=8, PP=16). Communication patterns, optimization, capstone implementation outline.

Şükrü Yusuf KAYA
80 dakikalık okuma
İleri
3D Parallelism: Tensor + Pipeline + Data Parallel — Llama-3 70B ve 405B Training
🌐 3D Parallelism — frontier LLM training'in zirvesi
Llama-3-8B FSDP'le bir node'da OK. Ama Llama-3-70B? 405B? FSDP yetmez. Activation memory, communication bandwidth, optimizer state — hepsi farklı şekillerde shard edilmeli. Çözüm: 3D parallelism — Tensor Parallel (TP, Megatron-LM 2019), Pipeline Parallel (PP, GPipe 2018), Data Parallel (DP). Combine: model her dimension'da shard. Llama-3-70B: DP=192 × TP=8 × PP=16 = 24576 GPU. Frontier modellerin nasıl eğitildiğinin gizli mimarisi. 80 dakika sonra: TP/PP/DP matematiksel anatomisini, Megatron-LM yaklaşımını, 70B+ training pipeline'ını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. FSDP'nin sınırı — 70B+ niye yetmez
  2. Tensor Parallelism (TP) — Shoeybi 2019 Megatron-LM
  3. TP communication — all-reduce per layer
  4. Pipeline Parallelism (PP) — Huang 2018 GPipe
  5. Pipeline bubble — pipeline efficiency
  6. Micro-batching — bubble mitigation
  7. 3D combine — DP × TP × PP
  8. Llama-3-70B setup — actual configuration
  9. Communication topology — NVLink + IB
  10. Future: 4D parallelism — sequence parallel

2-8. TP + PP + 3D Math#

2.1 Tensor Parallel (Megatron-LM)#

Shoeybi 2019: 'Megatron-LM' — single transformer layer'ı multiple GPU'ya split.
FFN example:
FFN(x) = (x W_1) W_2
2-way TP: W_1 column-shard, W_2 row-shard.
  • GPU 0: W_1[:,
    /2], W_2[
    /2, :]
  • GPU 1: W_1[:, d/2:], W_2[d/2:, :]
Forward: each GPU computes half. AllReduce at end.
Math:
out = (x W_1[:, :d/2]) W_2[:d/2, :] + (x W_1[:, d/2:]) W_2[d/2:, :] = AllReduce(per-GPU partial outputs)

2.2 TP attention#

Multi-head attention shard: each GPU handles subset of heads.
Llama-3 32 heads, TP=8 → 4 heads per GPU
Each GPU computes Q/K/V for its heads → attention → output. AllReduce final.

2.3 TP communication#

Every layer: 2 AllReduce (FFN + attention) per forward, 2 per backward. Heavy. Must be fast network — NVLink within node ideal. Cross-node IB OK but slower.
Llama-3: TP=8 (within node, NVLink). Never cross-node TP.

2.4 Pipeline Parallel (GPipe)#

Huang 2018: 'GPipe'. Split layers across GPUs.
GPU 0: layers 0-7 GPU 1: layers 8-15 GPU 2: layers 16-23 GPU 3: layers 24-31
Forward: GPU 0 → GPU 1 → ... pipeline. Activations transfer between GPUs.

2.5 Pipeline bubble#

Sequential pipeline naive: GPU 1 idle while GPU 0 forward, GPU 0 idle while GPU 1 backward etc.
GPU 0 fwd: 1, 2, 3, 4 ... GPU 1 fwd: 1, 2, 3, 4 ... (start after GPU 0)
Bubble = idle time per pipeline stage. Inefficiency.

2.6 Micro-batching#

GPipe trick: split batch into M micro-batches. Pipeline:
Batch = 32, M=4 micro-batches each size 8 GPU 0 fwd: m1, m2, m3, m4 GPU 1 fwd: m1, m2, m3, m4 GPU 2 fwd: m1, m2, m3, m4
Bubble overhead: O(M / total_batches). Larger M → less bubble.

2.7 1F1B (one forward one backward) scheduling#

Narayanan 2021: PipeDream-1F1B. Better scheduling reduces bubble further. Forward + backward interleaved.

2.8 3D combine#

DP × TP × PP = total GPUs.
Llama-3-70B setup (estimated):
  • DP = 192 (data parallel groups)
  • TP = 8 (within node, NVLink)
  • PP = 16 (across nodes, layer splits)
  • Total: 192 × 8 × 16 = 24,576 GPUs

2.9 Memory math#

Each GPU memory:
  • TP=8: params/8 (model shard)
  • PP=16: 1/16 of layers
  • Combined: P / (TP × PP) = P / 128 per GPU
  • DP: replicates this group across 192 groups (full memory in each DP group)
Llama-3-70B per GPU memory:
  • Param shard: 70B / 128 ≈ 550M params × 2 byte = 1.1 GB
  • Optimizer (AdamW): 4× = 4.4 GB
  • Activations: 4-8 GB
  • Total: ~12-15 GB per GPU

2.10 Communication topology#

  • TP: high bandwidth needed → NVLink within node (8 GPUs)
  • PP: low bandwidth → InfiniBand cross-node OK (activations small)
  • DP: AllReduce gradients periodically → InfiniBand
🎉 Modül 13 Tamamlandı — Distributed Training
3 ders boyunca: DDP (data parallel temeli, AllReduce + NCCL), FSDP/ZeRO (sharded — Rajbhandari 2020, Llama-3-8B tek node), 3D parallelism (TP + PP + DP — Llama-3-70B 24K GPU). Modern frontier LLM training'in sistemsel anatomisi tamamlandı. Modül 13 envanteri: 3 ders, 225 dk. Genel müfredat: 14 modül, 80 ders, ~75 saat. Sıradaki: Modül 14 — Fine-tuning (SFT, LoRA, QLoRA) ve Modül 15 — RLHF + DPO.

Modül 13 Envanteri (Tamamlandı)#

#DersSüre
13.1DDP + AllReduce + NCCL70 dk
13.2FSDP + ZeRO (Rajbhandari 2020)75 dk
13.33D Parallelism TP+PP+DP80 dk
Toplam3 ders225 dk (~3.75 saat)

Sık Sorulan Sorular

Küçük (1-3B Türkçe model): FSDP only (single node 8 GPU). 7B model: FSDP + maybe TP. 70B+ Türkçe modeli: pratik olarak Llama-3 base + Türkçe fine-tune (sıfırdan eğitilemez Turkish corpus limited).

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler