DeepSeek-V3 / R1 (671B, 37B Active): Shared Expert + Fine-Grained Routing — Where to LoRA?

DeepSeek-V3 (671B params, 37B active) — best open example of modern MoE. Shared expert (common knowledge for every token) + 256 routed experts (fine-grained). DeepSeek-R1 same arch + RL for reasoning. Impossible on RTX 4090; cookbook's cloud recipe 16×H100 NDR IB + ZeRO-Infinity + expert parallelism.

Şükrü Yusuf KAYA

36 min read

5/14/2026

Advanced

DeepSeek-V3 / R1 (671B, 37B Active): Shared Expert + Fine-Grained Routing — LoRA Hangi Parçaya?

1. DeepSeek-V3 Mimari#

Aspect	Value
Total params	671B
Active params per token	37B
Layers	61
Hidden	7168
Routed experts	256
Active routed experts per token	8
Shared experts	1 (always-active)
Hidden per expert (FFN)	2048
Context length	128K (YaRN)
Pre-train tokens	14.8T

Mimari inovasyonlar:

Shared expert — her token bu expert'i geçer (common knowledge stable)
Fine-grained routing — 256 expert × top-8 (klasik 8×top-2 yerine), daha iyi specialization
Auxiliary-loss-free balancing — load balance bias terimi ile (manual aux loss yok)
Multi-Token Prediction (MTP) — pre-train objective ek next-2-token prediction

2. DeepSeek-V3 FT Stratejisi#

671B model FT için cookbook'un sıralı seçenekleri:

(a) LoRA on attention only — En ucuz, kalite orta

W (NF4) sharded: ~85 GB / GPU 16-GPU
LoRA params: ~200M
Train: 8-16×H100 + FSDP + Expert Parallelism (EP=8)
50K instruction 1 epoch: ~8-12 saat 16×H100
Cost: $400-600

(b) LoRA on routed experts only — Specialization adapt

LoRA target:
mlp.experts.{0..255}.gate_proj
(et al.)
Trainable params: 800M-1B (256 expert × LoRA)
Cost: aynı

(c) Full SFT — En iyi kalite, çok pahalı

64-128×H100 + FSDP + EP + PP
50K instruction: 1-2 gün
Cost: $5K-15K

Cookbook tavsiyesi: (a) — attention LoRA cost-effective.

python

# === DeepSeek-V3 Expert Parallelism (EP) ===
# EP = expert'leri GPU'lar arasında dağıt
# 256 expert / 8 EP = 32 expert per GPU
 
# Megatron-DeepSpeed config
ep_size = 8                  # 256 experts / 8 = 32 per GPU
tp_size = 2                  # Tensor parallel for attention
pp_size = 1                  # No pipeline
dp_size = 16 / (ep_size * tp_size)  # Data parallel
 
# DeepSpeed config
{
  "zero_optimization": {
    "stage": 3,
    "offload_param": { "device": "none" },
    "offload_optimizer": { "device": "cpu" },     # 37B active params → optim CPU
  },
  "moe": {
    "enabled": true,
    "ep_size": 8,
    "experts_per_layer": 256,
    "load_balancing_type": "aux_free",            # DeepSeek-V3 stilinde
    "use_residual": true,                          # shared expert
  },
}

DeepSeek-V3 EP config

✅ Teslim

DeepSeek-V3 paper'ını oku — özellikle MoE design section. 2) Cloud erişim varsa 16-GPU mini-FT planla. 3) Sonraki ders: 5.4 — Qwen3-MoE + Llama-4-MoE Pattern.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

DeepSeek-V3 / R1 (671B, 37B Active): Shared Expert + Fine-Grained Routing — Where to LoRA?

1. DeepSeek-V3 Mimari#

2. DeepSeek-V3 FT Stratejisi#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter