Tensor Parallelism (Megatron): Column-Parallel + Row-Parallel Linear — Splitting the Matrix

Megatron-LM (NVIDIA) Tensor Parallel: matrix split *within itself* across GPUs. Column-parallel linear (output channels split), row-parallel (input channels), all-reduce/gather pattern. TP=2 vs TP=4 on 8×H100. FSDP+TP = 2D parallelism.

Şükrü Yusuf KAYA

32 min read

5/14/2026

Advanced

Tensor Parallelism (Megatron): Column-Parallel + Row-Parallel Linear — Matrix'i Böl

1. Tensor Parallel'in Çekirdek Fikri#

FSDP ZeRO'da: her GPU farklı parametrelerin shard'larını tutar; forward'da tüm shard'lar gather edilir.

TP'de: bir matrix

W ∈ R^{out × in}

'in kendisi bölünür:

Column-parallel:
W = [W_1 | W_2 | ... | W_p]
(output channels'a göre split)
Row-parallel:
W = [W_1; W_2; ...; W_p]
(input channels'a göre split)

Her GPU farklı chunk'a sahip; gather sonra değil, before veya after.

Transformer'da kullanım:#

q_proj, k_proj, v_proj
— column-parallel (output heads split)
o_proj
— row-parallel (input heads gather)
gate_proj, up_proj
— column-parallel
down_proj
— row-parallel

Bunlar nasıl pair'lenir? Her transformer block'ta:

input (replicated)
  → q/k/v column-parallel → her GPU farklı head'leri tutar (no comm)
  → attention compute (per-GPU local)
  → o_proj row-parallel → all-reduce sonunda
  → ffn gate/up column-parallel
  → silu(gate) * up (per-GPU local)
  → ffn down row-parallel → all-reduce sonunda
output (replicated)

python

# === Megatron-LM TP — Llama 70B (TP=8 1-node, 8×H100 SXM) ===
# Megatron-LM repo gerekir
# git clone https://github.com/NVIDIA/Megatron-LM
 
# Llama 70B'ı Megatron formatına convert
python tools/checkpoint/util.py \
    --model-type GPT \
    --loader llama_hf \
    --saver megatron \
    --target-tensor-parallel-size 8 \
    --target-pipeline-parallel-size 1 \
    --load-dir meta-llama/Llama-3.3-70B-Instruct \
    --save-dir llama-3.3-70b-megatron-tp8 \
    --tokenizer-model meta-llama/Llama-3.3-70B-Instruct/tokenizer.model
 
# Train script
torchrun --nproc_per_node=8 --nnodes=1 \
    pretrain_gpt.py \
    --tensor-model-parallel-size 8 \
    --pipeline-model-parallel-size 1 \
    --num-layers 80 --hidden-size 8192 --num-attention-heads 64 \
    --seq-length 4096 \
    --max-position-embeddings 4096 \
    --micro-batch-size 1 --global-batch-size 32 \
    --lr 5e-6 --train-iters 5000 \
    --bf16 \
    --load llama-3.3-70b-megatron-tp8 --save ckpt/

Megatron-LM TP=8 Llama 70B

2. FSDP + TP = 2D Parallelism#

Tek bir paradigma yetmediğinde combine:

Senaryo: 2-node × 8 GPU = 16 GPU, Llama 70B

Node 1: 8 GPU, TP=8 (intra-node NVLink ile)
Node 2: 8 GPU, TP=8 (intra-node NVLink ile)
FSDP across nodes (HYBRID_SHARD): node-arası grad reduce

TP_size × DP_size = total_GPUs
8 × 2 = 16

Avantaj: Intra-node hızlı NVLink/NVSwitch (450 GB/s); inter-node yavaş IB (50 GB/s). TP intra-node, FSDP inter-node → bandwidth optimal.

PyTorch 2.4+'da

DeviceMesh

API'sı ile native:

from torch.distributed.device_mesh import init_device_mesh
mesh = init_device_mesh("cuda", (2, 8), mesh_dim_names=("dp", "tp"))
# Now apply TP on tp dim, FSDP on dp dim

✅ Teslim

Megatron-LM repo'sunu klonla, küçük 1B model TP=2 ile dene. 2) FSDP-only vs TP-only throughput karşılaştır. 3) Sonraki ders: 4.5 — Pipeline Parallelism.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Tensor Parallelism (Megatron): Column-Parallel + Row-Parallel Linear — Splitting the Matrix

1. Tensor Parallel'in Çekirdek Fikri#

Transformer'da kullanım:#

2. FSDP + TP = 2D Parallelism#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter