Skip to content

Tensor Parallelism (Megatron): Column-Parallel + Row-Parallel Linear — Splitting the Matrix

Megatron-LM (NVIDIA) Tensor Parallel: matrix split *within itself* across GPUs. Column-parallel linear (output channels split), row-parallel (input channels), all-reduce/gather pattern. TP=2 vs TP=4 on 8×H100. FSDP+TP = 2D parallelism.

Şükrü Yusuf KAYA
32 min read
Advanced
Tensor Parallelism (Megatron): Column-Parallel + Row-Parallel Linear — Matrix'i Böl

1. Tensor Parallel'in Çekirdek Fikri#

FSDP ZeRO'da: her GPU farklı parametrelerin shard'larını tutar; forward'da tüm shard'lar gather edilir.
TP'de: bir matrix
W ∈ R^{out × in}
'in kendisi bölünür:
  • Column-parallel:
    W = [W_1 | W_2 | ... | W_p]
    (output channels'a göre split)
  • Row-parallel:
    W = [W_1; W_2; ...; W_p]
    (input channels'a göre split)
Her GPU farklı chunk'a sahip; gather sonra değil, before veya after.

Transformer'da kullanım:#

  • q_proj, k_proj, v_proj
    — column-parallel (output heads split)
  • o_proj
    — row-parallel (input heads gather)
  • gate_proj, up_proj
    — column-parallel
  • down_proj
    — row-parallel
Bunlar nasıl pair'lenir? Her transformer block'ta:
input (replicated) → q/k/v column-parallel → her GPU farklı head'leri tutar (no comm) → attention compute (per-GPU local) → o_proj row-parallel → all-reduce sonunda → ffn gate/up column-parallel → silu(gate) * up (per-GPU local) → ffn down row-parallel → all-reduce sonunda output (replicated)
python
# === Megatron-LM TP — Llama 70B (TP=8 1-node, 8×H100 SXM) ===
# Megatron-LM repo gerekir
# git clone https://github.com/NVIDIA/Megatron-LM
 
# Llama 70B'ı Megatron formatına convert
python tools/checkpoint/util.py \
--model-type GPT \
--loader llama_hf \
--saver megatron \
--target-tensor-parallel-size 8 \
--target-pipeline-parallel-size 1 \
--load-dir meta-llama/Llama-3.3-70B-Instruct \
--save-dir llama-3.3-70b-megatron-tp8 \
--tokenizer-model meta-llama/Llama-3.3-70B-Instruct/tokenizer.model
 
# Train script
torchrun --nproc_per_node=8 --nnodes=1 \
pretrain_gpt.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--num-layers 80 --hidden-size 8192 --num-attention-heads 64 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--micro-batch-size 1 --global-batch-size 32 \
--lr 5e-6 --train-iters 5000 \
--bf16 \
--load llama-3.3-70b-megatron-tp8 --save ckpt/
Megatron-LM TP=8 Llama 70B

2. FSDP + TP = 2D Parallelism#

Tek bir paradigma yetmediğinde combine:
Senaryo: 2-node × 8 GPU = 16 GPU, Llama 70B
  • Node 1: 8 GPU, TP=8 (intra-node NVLink ile)
  • Node 2: 8 GPU, TP=8 (intra-node NVLink ile)
  • FSDP across nodes (HYBRID_SHARD): node-arası grad reduce
TP_size × DP_size = total_GPUs 8 × 2 = 16
Avantaj: Intra-node hızlı NVLink/NVSwitch (450 GB/s); inter-node yavaş IB (50 GB/s). TP intra-node, FSDP inter-node → bandwidth optimal.
PyTorch 2.4+'da
DeviceMesh
API'sı ile native:
from torch.distributed.device_mesh import init_device_mesh mesh = init_device_mesh("cuda", (2, 8), mesh_dim_names=("dp", "tp")) # Now apply TP on tp dim, FSDP on dp dim
✅ Teslim
  1. Megatron-LM repo'sunu klonla, küçük 1B model TP=2 ile dene. 2) FSDP-only vs TP-only throughput karşılaştır. 3) Sonraki ders: 4.5 — Pipeline Parallelism.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content