İçeriğe geç

DeepSeek-V3 / R1 (671B, 37B Active): Shared Expert + Fine-Grained Routing — LoRA Hangi Parçaya?

DeepSeek-V3 (671B param, 37B active) — modern MoE'in en iyi açık örneği. Shared expert (her token'a giden 'common knowledge') + 256 routed expert (fine-grained). DeepSeek-R1 aynı mimari + RL ile reasoning. RTX 4090'da impossible; cookbook'un cloud reçetesi 16×H100 NDR IB + ZeRO-Infinity + expert parallelism.

Şükrü Yusuf KAYA
36 dakikalık okuma
İleri
DeepSeek-V3 / R1 (671B, 37B Active): Shared Expert + Fine-Grained Routing — LoRA Hangi Parçaya?

1. DeepSeek-V3 Mimari#

AspectValue
Total params671B
Active params per token37B
Layers61
Hidden7168
Routed experts256
Active routed experts per token8
Shared experts1 (always-active)
Hidden per expert (FFN)2048
Context length128K (YaRN)
Pre-train tokens14.8T
Mimari inovasyonlar:
  1. Shared expert — her token bu expert'i geçer (common knowledge stable)
  2. Fine-grained routing — 256 expert × top-8 (klasik 8×top-2 yerine), daha iyi specialization
  3. Auxiliary-loss-free balancing — load balance bias terimi ile (manual aux loss yok)
  4. Multi-Token Prediction (MTP) — pre-train objective ek next-2-token prediction

2. DeepSeek-V3 FT Stratejisi#

671B model FT için cookbook'un sıralı seçenekleri:
(a) LoRA on attention only — En ucuz, kalite orta
  • W (NF4) sharded: ~85 GB / GPU 16-GPU
  • LoRA params: ~200M
  • Train: 8-16×H100 + FSDP + Expert Parallelism (EP=8)
  • 50K instruction 1 epoch: ~8-12 saat 16×H100
  • Cost: $400-600
(b) LoRA on routed experts only — Specialization adapt
  • LoRA target:
    mlp.experts.{0..255}.gate_proj
    (et al.)
  • Trainable params: 800M-1B (256 expert × LoRA)
  • Cost: aynı
(c) Full SFT — En iyi kalite, çok pahalı
  • 64-128×H100 + FSDP + EP + PP
  • 50K instruction: 1-2 gün
  • Cost: $5K-15K
Cookbook tavsiyesi: (a) — attention LoRA cost-effective.
python
# === DeepSeek-V3 Expert Parallelism (EP) ===
# EP = expert'leri GPU'lar arasında dağıt
# 256 expert / 8 EP = 32 expert per GPU
 
# Megatron-DeepSpeed config
ep_size = 8 # 256 experts / 8 = 32 per GPU
tp_size = 2 # Tensor parallel for attention
pp_size = 1 # No pipeline
dp_size = 16 / (ep_size * tp_size) # Data parallel
 
# DeepSpeed config
{
"zero_optimization": {
"stage": 3,
"offload_param": { "device": "none" },
"offload_optimizer": { "device": "cpu" }, # 37B active params → optim CPU
},
"moe": {
"enabled": true,
"ep_size": 8,
"experts_per_layer": 256,
"load_balancing_type": "aux_free", # DeepSeek-V3 stilinde
"use_residual": true, # shared expert
},
}
DeepSeek-V3 EP config
✅ Teslim
  1. DeepSeek-V3 paper'ını oku — özellikle MoE design section. 2) Cloud erişim varsa 16-GPU mini-FT planla. 3) Sonraki ders: 5.4 — Qwen3-MoE + Llama-4-MoE Pattern.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler