DeepSeek-V3 / R1 (671B, 37B Active): Shared Expert + Fine-Grained Routing — Where to LoRA?
DeepSeek-V3 (671B params, 37B active) — best open example of modern MoE. Shared expert (common knowledge for every token) + 256 routed experts (fine-grained). DeepSeek-R1 same arch + RL for reasoning. Impossible on RTX 4090; cookbook's cloud recipe 16×H100 NDR IB + ZeRO-Infinity + expert parallelism.
Şükrü Yusuf KAYA
36 min read
Advanced1. DeepSeek-V3 Mimari#
| Aspect | Value |
|---|---|
| Total params | 671B |
| Active params per token | 37B |
| Layers | 61 |
| Hidden | 7168 |
| Routed experts | 256 |
| Active routed experts per token | 8 |
| Shared experts | 1 (always-active) |
| Hidden per expert (FFN) | 2048 |
| Context length | 128K (YaRN) |
| Pre-train tokens | 14.8T |
Mimari inovasyonlar:
- Shared expert — her token bu expert'i geçer (common knowledge stable)
- Fine-grained routing — 256 expert × top-8 (klasik 8×top-2 yerine), daha iyi specialization
- Auxiliary-loss-free balancing — load balance bias terimi ile (manual aux loss yok)
- Multi-Token Prediction (MTP) — pre-train objective ek next-2-token prediction
2. DeepSeek-V3 FT Stratejisi#
671B model FT için cookbook'un sıralı seçenekleri:
(a) LoRA on attention only — En ucuz, kalite orta
- W (NF4) sharded: ~85 GB / GPU 16-GPU
- LoRA params: ~200M
- Train: 8-16×H100 + FSDP + Expert Parallelism (EP=8)
- 50K instruction 1 epoch: ~8-12 saat 16×H100
- Cost: $400-600
(b) LoRA on routed experts only — Specialization adapt
- LoRA target: (et al.)
mlp.experts.{0..255}.gate_proj - Trainable params: 800M-1B (256 expert × LoRA)
- Cost: aynı
(c) Full SFT — En iyi kalite, çok pahalı
- 64-128×H100 + FSDP + EP + PP
- 50K instruction: 1-2 gün
- Cost: $5K-15K
Cookbook tavsiyesi: (a) — attention LoRA cost-effective.
python
# === DeepSeek-V3 Expert Parallelism (EP) ===# EP = expert'leri GPU'lar arasında dağıt# 256 expert / 8 EP = 32 expert per GPU # Megatron-DeepSpeed configep_size = 8 # 256 experts / 8 = 32 per GPUtp_size = 2 # Tensor parallel for attentionpp_size = 1 # No pipelinedp_size = 16 / (ep_size * tp_size) # Data parallel # DeepSpeed config{ "zero_optimization": { "stage": 3, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "cpu" }, # 37B active params → optim CPU }, "moe": { "enabled": true, "ep_size": 8, "experts_per_layer": 256, "load_balancing_type": "aux_free", # DeepSeek-V3 stilinde "use_residual": true, # shared expert },}DeepSeek-V3 EP config
✅ Teslim
- DeepSeek-V3 paper'ını oku — özellikle MoE design section. 2) Cloud erişim varsa 16-GPU mini-FT planla. 3) Sonraki ders: 5.4 — Qwen3-MoE + Llama-4-MoE Pattern.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations