Which parallelism combo for Turkish LLM?

Small (1-3B Turkish model): FSDP only (single node 8 GPU). 7B model: FSDP + maybe TP. 70B+ Turkish model: practically Llama-3 base + Turkish fine-tune (can't train from scratch, Turkish corpus limited).

3D Parallelism Ultra Detaylı: TP + PP + DP Llama-3 70B Frontier Training

Ders Haritası (10 Bölüm)#

FSDP'nin sınırı — 70B+ niye yetmez
Tensor Parallelism (TP) — Shoeybi 2019 Megatron-LM
TP communication — all-reduce per layer
Pipeline Parallelism (PP) — Huang 2018 GPipe
Pipeline bubble — pipeline efficiency
Micro-batching — bubble mitigation
3D combine — DP × TP × PP
Llama-3-70B setup — actual configuration
Communication topology — NVLink + IB
Future: 4D parallelism — sequence parallel

2-8. TP + PP + 3D Math#

2.1 Tensor Parallel (Megatron-LM)#

Shoeybi 2019: 'Megatron-LM' — single transformer layer'ı multiple GPU'ya split.

FFN example:

FFN(x) = (x W_1) W_2

2-way TP: W_1 column-shard, W_2 row-shard.

GPU 0: W_1[:,
/2], W_2[
/2, :]
GPU 1: W_1[:, d/2:], W_2[d/2:, :]

Forward: each GPU computes half. AllReduce at end.

Math:

out = (x W_1[:, :d/2]) W_2[:d/2, :] + (x W_1[:, d/2:]) W_2[d/2:, :]
    = AllReduce(per-GPU partial outputs)

2.2 TP attention#

Multi-head attention shard: each GPU handles subset of heads.

Llama-3 32 heads, TP=8 → 4 heads per GPU

Each GPU computes Q/K/V for its heads → attention → output. AllReduce final.

2.3 TP communication#

Every layer: 2 AllReduce (FFN + attention) per forward, 2 per backward. Heavy. Must be fast network — NVLink within node ideal. Cross-node IB OK but slower.

Llama-3: TP=8 (within node, NVLink). Never cross-node TP.

2.4 Pipeline Parallel (GPipe)#

Huang 2018: 'GPipe'. Split layers across GPUs.

GPU 0: layers 0-7
GPU 1: layers 8-15
GPU 2: layers 16-23
GPU 3: layers 24-31

Forward: GPU 0 → GPU 1 → ... pipeline. Activations transfer between GPUs.

2.5 Pipeline bubble#

Sequential pipeline naive: GPU 1 idle while GPU 0 forward, GPU 0 idle while GPU 1 backward etc.

GPU 0 fwd: 1, 2, 3, 4 ...
GPU 1 fwd:    1, 2, 3, 4 ...     (start after GPU 0)

Bubble = idle time per pipeline stage. Inefficiency.

2.6 Micro-batching#

GPipe trick: split batch into M micro-batches. Pipeline:

Batch = 32, M=4 micro-batches each size 8
GPU 0 fwd: m1, m2, m3, m4
GPU 1 fwd:    m1, m2, m3, m4
GPU 2 fwd:       m1, m2, m3, m4

Bubble overhead: O(M / total_batches). Larger M → less bubble.

2.7 1F1B (one forward one backward) scheduling#

Narayanan 2021: PipeDream-1F1B. Better scheduling reduces bubble further. Forward + backward interleaved.

2.8 3D combine#

DP × TP × PP = total GPUs.

Llama-3-70B setup (estimated):

DP = 192 (data parallel groups)
TP = 8 (within node, NVLink)
PP = 16 (across nodes, layer splits)
Total: 192 × 8 × 16 = 24,576 GPUs

2.9 Memory math#

Each GPU memory:

TP=8: params/8 (model shard)
PP=16: 1/16 of layers
Combined: P / (TP × PP) = P / 128 per GPU
DP: replicates this group across 192 groups (full memory in each DP group)

Llama-3-70B per GPU memory:

Param shard: 70B / 128 ≈ 550M params × 2 byte = 1.1 GB
Optimizer (AdamW): 4× = 4.4 GB
Activations: 4-8 GB
Total: ~12-15 GB per GPU

2.10 Communication topology#

TP: high bandwidth needed → NVLink within node (8 GPUs)
PP: low bandwidth → InfiniBand cross-node OK (activations small)
DP: AllReduce gradients periodically → InfiniBand

Modül 13 Envanteri (Tamamlandı)#

#	Ders	Süre
13.1	DDP + AllReduce + NCCL	70 dk
13.2	FSDP + ZeRO (Rajbhandari 2020)	75 dk
13.3	3D Parallelism TP+PP+DP	80 dk
Toplam	3 ders	225 dk (~3.75 saat)

3D Parallelism: Tensor + Pipeline + Data Parallel — Training Llama-3 70B and 405B

Ders Haritası (10 Bölüm)#

2-8. TP + PP + 3D Math#

2.1 Tensor Parallel (Megatron-LM)#

2.2 TP attention#

2.3 TP communication#

2.4 Pipeline Parallel (GPipe)#

2.5 Pipeline bubble#

2.6 Micro-batching#

2.7 1F1B (one forward one backward) scheduling#

2.8 3D combine#

2.9 Memory math#

2.10 Communication topology#

Modül 13 Envanteri (Tamamlandı)#

Frequently Asked Questions

Which parallelism combo for Turkish LLM?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

Subscribe to Newsletter