How does HuggingFace Accelerate simplify FSDP?

**Config + launcher abstraction**: \`accelerate config\` interactive setup, no need to manually write \`ds_config.json\` or FSDP details. Backend supports both FSDP and DeepSpeed. Advantage: rapid prototyping, simple projects. Disadvantage: low-level access lost, advanced tuning difficult. Modern recommendation: Accelerate for prototyping, custom FSDP code for production (for control).

Why is pipeline parallelism not in FSDP?

Pipeline parallelism is a **separate axis**. FSDP optimizes the data parallel (sharded) axis. Pipeline is a different architecture — layers on different GPUs. These are **complementary**, FSDP + pipeline parallel are combined as **3D parallelism**. In PyTorch ecosystem: \`torch.distributed.pipeline.sync\` (simple pipeline), **NeMo Megatron** (production pipeline), **DeepSpeed-Pipeline**. As FSDP scales to frontier models, pipeline integration is an active development area.

Does FSDP + torch.compile work?

Works, **stable in PyTorch 2.3+**. Previous versions had graph breaks. Modern pattern: \`FSDP(torch.compile(model), ...)\` or reverse. Performance gain: torch.compile 15-25% + FSDP memory + DDP overlap = net 25-40% throughput improvement. Trade-off: long compile time (~5-10 min for large models). Production: warmup + cache.

Is multi-node training practical in Türkiye?

Via cloud: yes. AWS HPC, GCP A3/A4, Lambda Labs, Modal offer multi-node H100 clusters. On-prem multi-node rare in Türkiye — InfiniBand cluster is specialty engineering. **Practical Turkish recommendation**: local/cloud single-node for 1-8 GPU range, cloud multi-node for larger needs. Cost: 8x H100 instance ~$25/hour. Llama 3 8B fine-tune ~$100-300, 70B fine-tune $1-3K (single epoch).

Is checkpoint save/load different in FSDP?

**Yes**. FSDP has sharded state — naive \`torch.save(model.state_dict())\` insufficient. PyTorch 2.0+ FSDP checkpoint APIs: \`torch.distributed.checkpoint\` (DCP) — sharded save/load. Or via \`FSDP.state_dict_type\`: **FULL_STATE_DICT** (consolidate to rank 0 — memory pressure), **SHARDED_STATE_DICT** (efficient but hard to reload with different topology), **LOCAL_STATE_DICT** (each rank saves itself). Production recommendation: DCP for efficiency, FULL_STATE_DICT for portability. Module 17 details.

torch.distributed In Depth: DDP, FSDP, ZeRO Stages — Production Distributed Training

We covered NCCL fundamentals in 5.4. Now production distributed training stack: DDP gradient bucketing + overlap, FSDP shard strategies (FULL_SHARD, SHARD_GRAD_OP, HYBRID_SHARD), DeepSpeed ZeRO Stage 1/2/3 comparison, hybrid 3D parallelism. Final bridge to Module 17.

Şükrü Yusuf KAYA

55 min read

5/13/2026

Advanced

torch.distributed Derinleştirilmiş: DDP, FSDP, ZeRO Stages — Production Distributed Training

🌐 70B model'i 80GB GPU'da nasıl train edersin?

Modern LLM training distributed framework'sız imkansız. DDP, FSDP, ZeRO — hepsi distributed training'in farklı stratejileri. 5.4'te NCCL ile communication primitive'lerini öğrendik. Şimdi production stack: hangi strateji ne zaman, memory/communication trade-off'ları, 8x H100 cluster'da Llama 3 70B fine-tune. 55 dakika sonra: production LLM mühendisinin bilmesi gereken her şey, Modül 17'ye geçiş hazır.

Ders Haritası#

Distributed training neden gerekli?
Data parallelism (DP, DDP)
DDP gradient bucketing + overlap deep dive
Model parallelism types: tensor, pipeline, sequence
ZeRO (Zero Redundancy Optimizer): Stages 1, 2, 3
FSDP (PyTorch native): FULL_SHARD vs SHARD_GRAD_OP vs HYBRID_SHARD
DeepSpeed vs FSDP karşılaştırma
3D Parallelism: DP × TP × PP
Production karar matrisi
Modül 17 köprüsü

1. Distributed Training Neden Gerekli?#

70B Llama 3 fine-tune scenario:

Weight BF16:    140 GB
Master FP32:    280 GB
Adam states:    560 GB
Gradient BF16:  140 GB
Activations:     50 GB
Total:         ~1.17 TB

Tek H100 80GB ile: OOM.

Çözümler#

Activation checkpointing: %90 activation tasarruf → 14GB. Total 1.13TB. Hâlâ çok.
8-bit optimizer: optimizer state 8-bit → 140GB. Hâlâ 720GB.
Distributed training: model + states birden çok GPU'ya shard.

Hangi axis'te paralleştirelim?#

Data parallel: her GPU full model, farklı batch shard. Communication: gradient sync.
Model parallel (tensor): her GPU model'in bir slice'ı. Communication: intermediate activations.
Pipeline parallel: her GPU bir layer set. Communication: pipeline activations.
Sequence parallel: aynı layer ama farklı seq pozisyonları. Long context için.

Modern LLM training: hibrit kombinasyon (3D parallelism). Modül 17 derin.

2. Data Parallelism (DP, DDP)#

En basit ve yaygın. Her GPU full model kopyası, farklı mini-batch shard.

Eski: DataParallel (deprecated)#

torch.nn.DataParallel

— single process, multi-thread. Python GIL bottleneck. Kullanma.

Modern: DistributedDataParallel (DDP)#

torch.nn.parallel.DistributedDataParallel

— multi-process. Her GPU ayrı Python process. NCCL communication.

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

model = MyModel().cuda()
model = DDP(model, device_ids=[local_rank])

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
sampler = DistributedSampler(dataset)
loader = DataLoader(dataset, batch_size=32, sampler=sampler)

for batch in loader:
    out = model(batch.cuda())
    loss = criterion(out, target.cuda())
    loss.backward()  # gradient sync otomatik (allreduce)
    optimizer.step()
    optimizer.zero_grad()

Memory dengesi#

Her GPU full model + full optimizer state. 70B model 8 GPU'da:

Her GPU 1.17 TB tutuyor → hâlâ OOM.

DDP küçük-orta model için. Büyük LLM için FSDP/ZeRO gerek.

Launch#

torchrun --nproc_per_node=8 train.py

3. DDP Gradient Bucketing + Overlap — Deep Dive#

Modern DDP backward sırasında gradient'leri bucket'larla allreduce ediyor.

Naive (without bucketing)#

backward() bitince:
allreduce gradient_1
allreduce gradient_2
...
allreduce gradient_N

Her allreduce GPU sync requires

Sequential allreduce → backward bitene kadar communication başlamıyor.

DDP bucketing#

backward sırasında:
Bucket 1 (25MB gradient hazır) → allreduce START (background)
Bucket 2 hazır → allreduce START
Bucket 3 hazır → ...

Communication ve computation overlap ediyor. ~%15-25 hızlanma.

Bucket size#

PyTorch default: 25 MB. Tuning:

model = DDP(model, device_ids=[local_rank], bucket_cap_mb=50)

Küçük bucket (5MB): çok overlap ama daha çok NCCL call → kernel launch overhead
Büyük bucket (100MB): az overlap ama az NCCL call

25MB sweet spot çoğu workload'da.

gradient_as_bucket_view#

model = DDP(model, device_ids=[local_rank], gradient_as_bucket_view=True)

Memory tasarruf: gradient bucket view olarak — extra tensor yok. Modern PyTorch'ta önerilen.

Static graph (advanced)#

model._set_static_graph()

DDP gradient sync sırasını sabitler → daha agresif optimization. Conditional control flow yoksa kullan.

Mixed precision interaction#

from torch.amp import autocast
with autocast(device_type="cuda", dtype=torch.bfloat16):
    out = model(x)
    loss = criterion(out, target)
loss.backward()  # BF16 gradient sync (efficient)

BF16 gradient half-size → communication 2x faster. NCCL BF16 native support PyTorch 2.0+.

4. Model Parallelism Types#

DDP yetmediğinde model'in kendisini paralleştirme. 3 ana tip:

Tensor Parallelism (TP)#

Bir tensor'u (örn. matrix W) column-wise veya row-wise GPU'lara böl. Her GPU partial computation, all_gather ile birleştir.

W: (4096, 4096)
GPU 0: W[:, 0:2048]
GPU 1: W[:, 2048:4096]

Compute: y = x @ W → her GPU partial y, sonra all_gather

Megatron-LM klasik implementation. Communication intensive (her layer'da activation sync).

Pipeline Parallelism (PP)#

Model'i layer-wise GPU'lara böl. Pipeline stages.

GPU 0: layers 0-10
GPU 1: layers 11-20
GPU 2: layers 21-30
GPU 3: layers 31-40

Forward: GPU 0 → 1 → 2 → 3
Backward: ters yönde

GPipe, PipeDream, 1F1B: pipeline scheduling algorithms.

Sequence Parallelism (SP)#

Long context için. Aynı layer'ı sequence position'a göre shard. FlashAttention + SP combination.

ZeRO / FSDP#

Hybrid: optimizer states + gradients + weights shard. Communication mostly DP-style allreduce'a benzer ama optimizer/gradient stages farklı zamanda gather'lı.

Modern LLM training çoğunlukla FSDP veya DeepSpeed ZeRO kullanır. Pure TP/PP frontier lab'larda + custom infrastructure.

5. ZeRO Stages — Zero Redundancy Optimizer#

DeepSpeed ZeRO (Rajbhandari 2020): optimizer states, gradients, weights'i GPU'lara shard et. Memory drastically azaltır.

Memory breakdown problem#

Standard DDP'de her GPU full model'a sahip. 70B model:

Weight BF16: 140GB
Master FP32: 280GB
Adam m+v: 560GB
Total per GPU: ~1 TB

8 GPU = 8x redundant (ZeRO öncesi).

ZeRO Stage 1: Optimizer States Sharded#

Adam m+v shard'lanıyor → her GPU sadece kendine düşen kısmı tutuyor:

Optimizer states: 560GB / 8 = 70GB per GPU
Weight + gradient: full (her GPU 140 + 140 = 280GB)
Total per GPU: ~430GB

Hâlâ büyük ama %60 tasarruf.

ZeRO Stage 2: + Gradients Sharded#

Stage 1 + gradient shard:

Optimizer states: 70GB
Gradients: 140GB / 8 = 17.5GB
Weight: 140GB (full)
Total per GPU: ~230GB

%75 tasarruf.

ZeRO Stage 3: + Weights Sharded (= FSDP)#

Stage 2 + weight shard:

Optimizer states: 70GB
Gradients: 17.5GB
Weights: 140GB / 8 = 17.5GB
Total per GPU: ~105GB

%90 tasarruf. 70B model 8 GPU 80GB'da fit ediyor.

Trade-off: communication#

Stage 3'te weight'ler shard → forward'da all_gather weight, backward'da reduce_scatter gradient.

Per-step communication:

Stage 1: optimizer sync sonu (1x M)
Stage 2: gradient reduce (1x M)
Stage 3: weight gather (forward) + gradient scatter (backward) (~2x M)

Stage 3 en çok memory tasarruf, en çok communication.

Pratik öneri#

Memory durumu	Önerilen Stage
Yeterince GPU memory	Stage 1 (en az comm)
Memory tight, communication budget OK	Stage 2
Memory critical	Stage 3

LLM training (70B+) modern öneri: Stage 3 (= FSDP FULL_SHARD).

6. FSDP (PyTorch Native) — Shard Strategies#

FSDP (Fully Sharded Data Parallel, PyTorch 1.11+) — ZeRO Stage 3 native PyTorch implementation. DeepSpeed dependency olmadan.

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import ShardingStrategy

model = MyModel().cuda()
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # ZeRO-3 eşdeğeri
    auto_wrap_policy=...,                            # nested FSDP wrapping
    mixed_precision=...,                             # BF16 mixed
)

Sharding strategies#

FULL_SHARD (ZeRO-3)

Weight, gradient, optimizer state — hepsi shard. Max memory savings, max communication.

SHARD_GRAD_OP (ZeRO-2 eşdeğeri)

Gradient + optimizer state shard. Weight tutar full. Daha az communication.

NO_SHARD (DDP eşdeğeri)

Hiçbir şey shard değil. Pure DDP. Memory savings yok.

HYBRID_SHARD (yeni, 2023+)

Çok güçlü pattern: cluster içinde FULL_SHARD (NVLink fast), cluster'lar arası NO_SHARD (data parallel).

Cluster 1: GPU 0-7 FULL_SHARD (intra-node)
Cluster 2: GPU 8-15 FULL_SHARD (intra-node)
Cluster 1 ↔ Cluster 2: DDP (inter-node)

Multi-node training'de çok daha hızlı: intra-node NVLink (600 GB/s), inter-node InfiniBand (300 GB/s). FULL_SHARD inter-node yapmak yavaş — HYBRID_SHARD bunu çözüyor.

_HYBRID_SHARD_ZERO2 (Stage-2 hybrid)

Intra-node SHARD_GRAD_OP, inter-node DDP. Memory daha az ama hâlâ savings.

auto_wrap_policy#

FSDP nested wrapping istiyor — her transformer block ayrı FSDP unit. Tüm modeli tek FSDP ile sarmak verimsiz.

from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

my_auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={TransformerBlock},  # Llama block class
)

model = FSDP(model, auto_wrap_policy=my_auto_wrap_policy)

Her TransformerBlock ayrı FSDP unit → her layer için ayrı all_gather/reduce_scatter (computation overlap).

Mixed precision in FSDP#

from torch.distributed.fsdp import MixedPrecision

mp_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)

model = FSDP(model, mixed_precision=mp_policy)

7. DeepSpeed vs FSDP — Karşılaştırma#

İkisi de ZeRO yapıyor. Farklar:

DeepSpeed#

Microsoft Research (2020)
ZeRO sahibi (orijinal paper)
Çok-feature: pipeline parallelism, MoE, inference engine, training optimization
Karmaşık config —
ds_config.json
100+ option
Production'da güçlü

FSDP (PyTorch native)#

Meta (PyTorch 1.11+, mature 2.0+)
Daha basit API
PyTorch native —
torch.distributed.fsdp
Modern Llama 3 training Meta'da FSDP kullanıldı

Hangisi?#

Use case	Önerilen
LLM training (70B-)	FSDP (basit, sufficient)
LLM training (>100B)	DeepSpeed (Pipeline parallel + 3D combine)
Research, iterative	FSDP (PyTorch ecosystem)
HuggingFace transformers + Accelerate	Her ikisi (Accelerate ikisini destekliyor)
Production-grade big model	DeepSpeed (battle-tested)

Migration#

Modern trend: FSDP → her workload için yeterli. Llama 3, Mistral, çoğu açık-kaynak Meta'nın patterns'i. DeepSpeed enterprise + Microsoft ecosystem'de güçlü.

HuggingFace Accelerate#

İkisini de wraplıyor:

accelerate config  # interactive setup
accelerate launch train.py

Beginner-friendly, abstract'lıyor. Production'da bazen low-level access kaybediliyor.

8. 3D Parallelism — DP × TP × PP#

Frontier-scale (1T+ param) training için tek strateji yetmiyor. 3D = 3 axis kombinasyon:

Axis 1: Data Parallelism (DP)#

Her DP rank farklı batch shard. Communication: allreduce gradient.

Axis 2: Tensor Parallelism (TP)#

Aynı layer ama TP rank'lar matrix shard. Communication: every-layer all_gather.

Axis 3: Pipeline Parallelism (PP)#

Farklı layer'lar farklı PP rank'larda. Communication: pipeline activations.

Hybrid example#

GPT-4 hypothesis (rumored): ~25,000 GPU
DP=16, TP=8, PP=200 = 25,600 total positions

DP=16: 16 different batch shards
TP=8: 8 GPUs share each layer's matrices
PP=200: model split into 200 pipeline stages

Niye 3D?#

TP alone: per-layer communication çok → bandwidth limit
PP alone: pipeline bubble (idle GPU) → throughput düşer
DP alone: memory limit (full model her GPU)
3D combine: her axis'in zayıflıkları diğeriyle compensate

Practical sweet spot#

Llama 3 70B training: DP=128 × TP=8 (TP intra-node, DP cross-node) — pipeline parallel kullanılmadı (~25K GPU değil).

Sequence Parallel (SP) — 4. axis#

Long context (128K+) için: SP ile sequence dimension shard. SP × TP × DP = effectively 4D. Llama 3 long-context training'de kullanıldı.

Modül 17 (Distributed Training) bunu fully detaylandırıyor.

9. Production Karar Matrisi#

Model size + hardware + budget → distributed strategy seçimi.

Senaryo 1: 7B model, 4× A100 80GB#

Memory comfort: yeterli
Strategy: DDP veya FSDP NO_SHARD
Reason: communication-efficient, basit

Senaryo 2: 13B model, 8× A100 80GB#

Memory tight (13B × 20 byte = 260 GB)
Strategy: FSDP SHARD_GRAD_OP
Reason: optimizer + gradient shard, weight full

Senaryo 3: 70B model, 8× H100 80GB#

Memory critical
Strategy: FSDP FULL_SHARD
Reason: full sharding gerekli

Senaryo 4: 70B model, 64× H100 (8 node × 8 GPU)#

Multi-node, mixed bandwidth
Strategy: FSDP HYBRID_SHARD (intra-node FULL_SHARD, inter-node DP)
Reason: NVLink fast intra, InfiniBand slower inter

Senaryo 5: 405B Llama 3.1, 1024× H100#

Frontier scale
Strategy: 3D parallelism (FSDP + TP + sequence parallel)
Reason: tek axis yetmez

Senaryo 6: Fine-tuning 7B, 1× A100 40GB#

Single GPU
Strategy: LoRA + QLoRA (Modül 21)
Reason: distributed gerek değil, parameter-efficient

Türk perspektif#

Türk şirketlerin %95'i Senaryo 1-3 arasında. Modal/Runpod gibi 1-8 GPU rental. DDP veya FSDP yeter. Frontier (Senaryo 5) sadece akademik araştırma veya TÜBİTAK ölçeği projelerde.

python

import os
import torch
import torch.distributed as dist
from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP,
    ShardingStrategy,
    MixedPrecision,
    BackwardPrefetch,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from functools import partial
 
# Init
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
 
# Model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
 
# FSDP wrap policy
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
auto_wrap_policy = partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={LlamaDecoderLayer},
)
 
# Mixed precision
mp = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.bfloat16,
    buffer_dtype=torch.bfloat16,
)
 
# Wrap with FSDP
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    auto_wrap_policy=auto_wrap_policy,
    mixed_precision=mp,
    device_id=local_rank,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,  # overlap optimization
    use_orig_params=True,
)
 
# Optimizer (FSDP automatically handles state)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
 
# Training loop
for batch in dataloader:
    outputs = model(batch.input_ids, labels=batch.labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Production-grade FSDP setup — Llama 3 fine-tuning.

10. Modül 17'ye Köprü#

Bu ders distributed training temellerini verdi. Modül 17 (Pre-training Compute Engineering) detayda:

DP, FSDP/ZeRO derin matematik
Tensor parallelism Megatron-LM stili
Pipeline parallelism GPipe + 1F1B + Interleaved
Sequence parallelism + Ring attention
3D parallelism configuration optimization
Communication pattern analysis (NVLink, IB, RDMA)
Failure recovery + checkpoint
Elastic training (dynamic resize)
Performance modeling

Ön hazırlık checklist#

Modül 17'e geçmeden önce:

DDP ile basit MLP training (2 GPU)
FSDP FULL_SHARD ile 7B model fine-tune deneme
Memory profiling (Modül 5.3) sırasında FSDP shard etkisi
NCCL_DEBUG=INFO ile communication trace okuma
HuggingFace Accelerate config etme

Pratik öneri#

Production'a girişin önce:

Single GPU'da prototyping
DDP ile 2-4 GPU validation
FSDP ile büyük scale
Modül 17 pattern'leri ile optimization

11. Mini Egzersizler#

Memory savings: 13B model 8× A100 40GB. DDP vs FSDP FULL_SHARD memory comparison?
Communication cost: 70B model, 8 GPU FSDP FULL_SHARD. Per-step communication volume?
HYBRID_SHARD karar: 4 node × 8 GPU = 32 GPU. NVLink intra-node, InfiniBand inter-node. Hangi strategy?
DDP bucket size: Llama 3 8B fine-tune, A100 cluster. Optimal bucket_cap_mb?
DeepSpeed vs FSDP: 400B custom model training. Hangisi?

Bu Derste Neler Öğrendik?#

✓ Distributed training neden — 70B model 1.17TB memory ✓ Data parallel (DDP) — full replication + gradient sync ✓ DDP gradient bucketing + overlap — production pattern ✓ Model parallel types: TP (Megatron), PP (GPipe), SP (Ring attention) ✓ ZeRO Stage 1/2/3 — optimizer/gradient/weight sharding ✓ FSDP strategies: FULL_SHARD, SHARD_GRAD_OP, HYBRID_SHARD, NO_SHARD ✓ DeepSpeed vs FSDP karşılaştırma ✓ 3D parallelism — DP × TP × PP frontier-scale ✓ Production karar matrisi — model size × hardware × strategy ✓ Modül 17 köprüsü — derin distributed training için hazır

Sıradaki Ders#

5.7 — Debug Arsenal: register_hook, anomaly mode, torch.utils.benchmark Production'da iş bozulduğunda toolkit. Hooks ile forward/backward intercept, anomaly detection, reproducibility patterns, deterministic mode. LLM mühendisinin günlük debug arsenali.

Frequently Asked Questions

Generally small overhead (~5-15%) but memory savings are dramatic. FSDP communication: forward all_gather + backward reduce_scatter vs DDP's single allreduce. NVLink-rich cluster (8x H100) — this difference isn't critical. Inter-node InfiniBand increases the gap (mitigated by HYBRID_SHARD). Practical decision: memory comfortable → DDP (fastest). Memory tight → FSDP necessary, tolerate 10% overhead.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Ders Haritası#

1. Distributed Training Neden Gerekli?#

Çözümler#

Hangi axis'te paralleştirelim?#

2. Data Parallelism (DP, DDP)#

Eski: DataParallel (deprecated)#

Modern: DistributedDataParallel (DDP)#

Memory dengesi#

Launch#

3. DDP Gradient Bucketing + Overlap — Deep Dive#

Naive (without bucketing)#

DDP bucketing#

Bucket size#

gradient_as_bucket_view#

Static graph (advanced)#

Mixed precision interaction#

4. Model Parallelism Types#

Tensor Parallelism (TP)#

Pipeline Parallelism (PP)#

Sequence Parallelism (SP)#

ZeRO / FSDP#

5. ZeRO Stages — Zero Redundancy Optimizer#

Memory breakdown problem#

ZeRO Stage 1: Optimizer States Sharded#

ZeRO Stage 2: + Gradients Sharded#

ZeRO Stage 3: + Weights Sharded (= FSDP)#

Trade-off: communication#

Pratik öneri#

6. FSDP (PyTorch Native) — Shard Strategies#

Sharding strategies#

FULL_SHARD (ZeRO-3)

SHARD_GRAD_OP (ZeRO-2 eşdeğeri)

NO_SHARD (DDP eşdeğeri)

HYBRID_SHARD (yeni, 2023+)

_HYBRID_SHARD_ZERO2 (Stage-2 hybrid)

auto_wrap_policy#

Mixed precision in FSDP#

7. DeepSpeed vs FSDP — Karşılaştırma#

DeepSpeed#

FSDP (PyTorch native)#

Hangisi?#

Migration#

HuggingFace Accelerate#

8. 3D Parallelism — DP × TP × PP#

Axis 1: Data Parallelism (DP)#

Axis 2: Tensor Parallelism (TP)#

Axis 3: Pipeline Parallelism (PP)#

Hybrid example#

Niye 3D?#

Practical sweet spot#

Sequence Parallel (SP) — 4. axis#

9. Production Karar Matrisi#

Senaryo 1: 7B model, 4× A100 80GB#

Senaryo 2: 13B model, 8× A100 80GB#

Senaryo 3: 70B model, 8× H100 80GB#

Senaryo 4: 70B model, 64× H100 (8 node × 8 GPU)#

Senaryo 5: 405B Llama 3.1, 1024× H100#

Senaryo 6: Fine-tuning 7B, 1× A100 40GB#

Türk perspektif#

10. Modül 17'ye Köprü#

Ön hazırlık checklist#

Pratik öneri#

11. Mini Egzersizler#

Bu Derste Neler Öğrendik?#

Sıradaki Ders#

Frequently Asked Questions

Is there performance loss when switching from DDP to FSDP?

How does HuggingFace Accelerate simplify FSDP?

Why is pipeline parallelism not in FSDP?

Does FSDP + torch.compile work?

Is multi-node training practical in Türkiye?

Is checkpoint save/load different in FSDP?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight