Tensor Parallelism (Megatron): Column-Parallel + Row-Parallel Linear — Splitting the Matrix
Megatron-LM (NVIDIA) Tensor Parallel: matrix split *within itself* across GPUs. Column-parallel linear (output channels split), row-parallel (input channels), all-reduce/gather pattern. TP=2 vs TP=4 on 8×H100. FSDP+TP = 2D parallelism.
Şükrü Yusuf KAYA
32 min read
Advanced1. Tensor Parallel'in Çekirdek Fikri#
FSDP ZeRO'da: her GPU farklı parametrelerin shard'larını tutar; forward'da tüm shard'lar gather edilir.
TP'de: bir matrix 'in kendisi bölünür:
W ∈ R^{out × in}- Column-parallel: (output channels'a göre split)
W = [W_1 | W_2 | ... | W_p] - Row-parallel: (input channels'a göre split)
W = [W_1; W_2; ...; W_p]
Her GPU farklı chunk'a sahip; gather sonra değil, before veya after.
Transformer'da kullanım:#
- — column-parallel (output heads split)
q_proj, k_proj, v_proj - — row-parallel (input heads gather)
o_proj - — column-parallel
gate_proj, up_proj - — row-parallel
down_proj
Bunlar nasıl pair'lenir? Her transformer block'ta:
input (replicated) → q/k/v column-parallel → her GPU farklı head'leri tutar (no comm) → attention compute (per-GPU local) → o_proj row-parallel → all-reduce sonunda → ffn gate/up column-parallel → silu(gate) * up (per-GPU local) → ffn down row-parallel → all-reduce sonunda output (replicated)
python
# === Megatron-LM TP — Llama 70B (TP=8 1-node, 8×H100 SXM) ===# Megatron-LM repo gerekir# git clone https://github.com/NVIDIA/Megatron-LM # Llama 70B'ı Megatron formatına convertpython tools/checkpoint/util.py \ --model-type GPT \ --loader llama_hf \ --saver megatron \ --target-tensor-parallel-size 8 \ --target-pipeline-parallel-size 1 \ --load-dir meta-llama/Llama-3.3-70B-Instruct \ --save-dir llama-3.3-70b-megatron-tp8 \ --tokenizer-model meta-llama/Llama-3.3-70B-Instruct/tokenizer.model # Train scripttorchrun --nproc_per_node=8 --nnodes=1 \ pretrain_gpt.py \ --tensor-model-parallel-size 8 \ --pipeline-model-parallel-size 1 \ --num-layers 80 --hidden-size 8192 --num-attention-heads 64 \ --seq-length 4096 \ --max-position-embeddings 4096 \ --micro-batch-size 1 --global-batch-size 32 \ --lr 5e-6 --train-iters 5000 \ --bf16 \ --load llama-3.3-70b-megatron-tp8 --save ckpt/Megatron-LM TP=8 Llama 70B
2. FSDP + TP = 2D Parallelism#
Tek bir paradigma yetmediğinde combine:
Senaryo: 2-node × 8 GPU = 16 GPU, Llama 70B
- Node 1: 8 GPU, TP=8 (intra-node NVLink ile)
- Node 2: 8 GPU, TP=8 (intra-node NVLink ile)
- FSDP across nodes (HYBRID_SHARD): node-arası grad reduce
TP_size × DP_size = total_GPUs 8 × 2 = 16
Avantaj: Intra-node hızlı NVLink/NVSwitch (450 GB/s); inter-node yavaş IB (50 GB/s). TP intra-node, FSDP inter-node → bandwidth optimal.
PyTorch 2.4+'da API'sı ile native:
DeviceMeshfrom torch.distributed.device_mesh import init_device_mesh mesh = init_device_mesh("cuda", (2, 8), mesh_dim_names=("dp", "tp")) # Now apply TP on tp dim, FSDP on dp dim
✅ Teslim
- Megatron-LM repo'sunu klonla, küçük 1B model TP=2 ile dene. 2) FSDP-only vs TP-only throughput karşılaştır. 3) Sonraki ders: 4.5 — Pipeline Parallelism.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations