Hangi parallelism kombinasyonu Türkçe LLM için?

3D Parallelism: Tensor + Pipeline + Data Parallel — Llama-3 70B ve 405B Training

Frontier LLM training: Megatron-LM'in 3D parallelism. Tensor Parallelism (Shoeybi 2019) — matrix splits across GPUs. Pipeline Parallelism (Huang 2018) — layer splits + bubble optimization. Combined 3D: DP × TP × PP. Llama-3 70B (DP=192, TP=8, PP=16). Communication patterns, optimization, capstone implementation outline.

Şükrü Yusuf KAYA

80 dakikalık okuma

13.05.2026

İleri

3D Parallelism: Tensor + Pipeline + Data Parallel — Llama-3 70B ve 405B Training

🌐 3D Parallelism — frontier LLM training'in zirvesi

Llama-3-8B FSDP'le bir node'da OK. Ama Llama-3-70B? 405B? FSDP yetmez. Activation memory, communication bandwidth, optimizer state — hepsi farklı şekillerde shard edilmeli. Çözüm: 3D parallelism — Tensor Parallel (TP, Megatron-LM 2019), Pipeline Parallel (PP, GPipe 2018), Data Parallel (DP). Combine: model her dimension'da shard. Llama-3-70B: DP=192 × TP=8 × PP=16 = 24576 GPU. Frontier modellerin nasıl eğitildiğinin gizli mimarisi. 80 dakika sonra: TP/PP/DP matematiksel anatomisini, Megatron-LM yaklaşımını, 70B+ training pipeline'ını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

FSDP'nin sınırı — 70B+ niye yetmez
Tensor Parallelism (TP) — Shoeybi 2019 Megatron-LM
TP communication — all-reduce per layer
Pipeline Parallelism (PP) — Huang 2018 GPipe
Pipeline bubble — pipeline efficiency
Micro-batching — bubble mitigation
3D combine — DP × TP × PP
Llama-3-70B setup — actual configuration
Communication topology — NVLink + IB
Future: 4D parallelism — sequence parallel

2-8. TP + PP + 3D Math#

2.1 Tensor Parallel (Megatron-LM)#

Shoeybi 2019: 'Megatron-LM' — single transformer layer'ı multiple GPU'ya split.

FFN example:

FFN(x) = (x W_1) W_2

2-way TP: W_1 column-shard, W_2 row-shard.

GPU 0: W_1[:,
/2], W_2[
/2, :]
GPU 1: W_1[:, d/2:], W_2[d/2:, :]

Forward: each GPU computes half. AllReduce at end.

Math:

out = (x W_1[:, :d/2]) W_2[:d/2, :] + (x W_1[:, d/2:]) W_2[d/2:, :]
    = AllReduce(per-GPU partial outputs)

2.2 TP attention#

Multi-head attention shard: each GPU handles subset of heads.

Llama-3 32 heads, TP=8 → 4 heads per GPU

Each GPU computes Q/K/V for its heads → attention → output. AllReduce final.

2.3 TP communication#

Every layer: 2 AllReduce (FFN + attention) per forward, 2 per backward. Heavy. Must be fast network — NVLink within node ideal. Cross-node IB OK but slower.

Llama-3: TP=8 (within node, NVLink). Never cross-node TP.

2.4 Pipeline Parallel (GPipe)#

Huang 2018: 'GPipe'. Split layers across GPUs.

GPU 0: layers 0-7
GPU 1: layers 8-15
GPU 2: layers 16-23
GPU 3: layers 24-31

Forward: GPU 0 → GPU 1 → ... pipeline. Activations transfer between GPUs.

2.5 Pipeline bubble#

Sequential pipeline naive: GPU 1 idle while GPU 0 forward, GPU 0 idle while GPU 1 backward etc.

GPU 0 fwd: 1, 2, 3, 4 ...
GPU 1 fwd:    1, 2, 3, 4 ...     (start after GPU 0)

Bubble = idle time per pipeline stage. Inefficiency.

2.6 Micro-batching#

GPipe trick: split batch into M micro-batches. Pipeline:

Batch = 32, M=4 micro-batches each size 8
GPU 0 fwd: m1, m2, m3, m4
GPU 1 fwd:    m1, m2, m3, m4
GPU 2 fwd:       m1, m2, m3, m4

Bubble overhead: O(M / total_batches). Larger M → less bubble.

2.7 1F1B (one forward one backward) scheduling#

Narayanan 2021: PipeDream-1F1B. Better scheduling reduces bubble further. Forward + backward interleaved.

2.8 3D combine#

DP × TP × PP = total GPUs.

Llama-3-70B setup (estimated):

DP = 192 (data parallel groups)
TP = 8 (within node, NVLink)
PP = 16 (across nodes, layer splits)
Total: 192 × 8 × 16 = 24,576 GPUs

2.9 Memory math#

Each GPU memory:

TP=8: params/8 (model shard)
PP=16: 1/16 of layers
Combined: P / (TP × PP) = P / 128 per GPU
DP: replicates this group across 192 groups (full memory in each DP group)

Llama-3-70B per GPU memory:

Param shard: 70B / 128 ≈ 550M params × 2 byte = 1.1 GB
Optimizer (AdamW): 4× = 4.4 GB
Activations: 4-8 GB
Total: ~12-15 GB per GPU

2.10 Communication topology#

TP: high bandwidth needed → NVLink within node (8 GPUs)
PP: low bandwidth → InfiniBand cross-node OK (activations small)
DP: AllReduce gradients periodically → InfiniBand

🎉 Modül 13 Tamamlandı — Distributed Training

3 ders boyunca: DDP (data parallel temeli, AllReduce + NCCL), FSDP/ZeRO (sharded — Rajbhandari 2020, Llama-3-8B tek node), 3D parallelism (TP + PP + DP — Llama-3-70B 24K GPU). Modern frontier LLM training'in sistemsel anatomisi tamamlandı. Modül 13 envanteri: 3 ders, 225 dk. Genel müfredat: 14 modül, 80 ders, ~75 saat. Sıradaki: Modül 14 — Fine-tuning (SFT, LoRA, QLoRA) ve Modül 15 — RLHF + DPO.

Modül 13 Envanteri (Tamamlandı)#

#	Ders	Süre
13.1	DDP + AllReduce + NCCL	70 dk
13.2	FSDP + ZeRO (Rajbhandari 2020)	75 dk
13.3	3D Parallelism TP+PP+DP	80 dk
Toplam	3 ders	225 dk (~3.75 saat)

Sık Sorulan Sorular

Küçük (1-3B Türkçe model): FSDP only (single node 8 GPU). 7B model: FSDP + maybe TP. 70B+ Türkçe modeli: pratik olarak Llama-3 base + Türkçe fine-tune (sıfırdan eğitilemez Turkish corpus limited).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Modül 0: Kurs Çerçevesi ve Atölye Kurulumu