Mixture of Experts (MoE): Sparse Activation Revolution — From Mixtral 8x7B to DeepSeek-V3
Mixture of Experts (MoE) architecture: sparse activation, expert routing (top-k gating), Mixtral 8x7B (Jan 2024) open-source revolution, DeepSeek-V3 671B (Dec 2024) frontier. Routing math (Shazeer 2017 outrageously sparse), auxiliary loss, load balancing. Memory-efficient frontier scale models.
Şükrü Yusuf KAYA
75 min read
Advanced🎭 MoE — frontier modellerin gizli silahı
DeepSeek-V3 (Aralık 2024): 671 milyar parametre. AMA inference'da sadece 37 milyar aktif. Memory 671B, compute 37B. 'En iyisinin en iyisi'. Mekanizma: Mixture of Experts (MoE). Shazeer 2017 'outrageously large neural networks' paper'ı. Mixtral 8x7B (Ocak 2024) açık-kaynak'da yaygınlaştı. GPT-4 (tahmini MoE), Gemini 1.5 — frontier model standardı. 75 dakika sonra: MoE matematiksel anatomisini, expert routing'i, Mixtral ve DeepSeek-V3 production detaylarını kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- Dense vs sparse model — niye MoE
- Shazeer 2017 — original MoE paper
- Top-k gating — routing mechanism
- Expert FFN — each expert is small FFN
- Routing math — softmax → top-k → weighted sum
- Auxiliary loss — load balancing
- Mixtral 8x7B (Ocak 2024) — open-source breakthrough
- DeepSeek-V3 671B — frontier scale MoE
- MoE inference — vLLM, dense-equivalent FLOPs
- Türkçe için — pratik implications
2-6. MoE Math#
2.1 Dense FFN recap#
Standard transformer block:
FFN(x) = down(silu(gate(x)) * up(x))
Dense: tüm params (d_ff) her token için aktif.
2.2 MoE intuition#
Replace single FFN with N experts (each own FFN):
MoE(x) = Σ_i gate(x)_i × expert_i(x)
gate produces N-dimensional probability over experts. Each expert is independent FFN.
2.3 Top-k sparse activation#
Key: only top-k experts activated per token (typically k=2):
weights = softmax(linear(x)) # [B, S, N] top_k_weights, top_k_indices = topk(weights, k=2) # only k experts out = Σ_{i in top_k} top_k_weights_i × expert_i(x)
N experts varsa, sadece k tanesi compute → sparse activation.
Memory: N × d_ff (all experts loaded)
Compute: k × d_ff (only top-k active)
N=8, k=2: 4x more memory, same compute as dense FFN.
2.4 Per-token routing#
Önemli: routing per-token. Aynı sequence içinde farklı tokens farklı experts.
Example: 'kapıyı çal' cümlesinde:
- 'kapı' → experts 1, 3
- 'yı' → experts 2, 5
- 'çal' → experts 1, 7
Fiziksel: model 'specialization' — bazı experts grammar, bazıları math, bazıları code öğrenir.
2.5 Load balancing problem#
Naive: bazı experts hep aktif, bazıları hiç. Underutilization.
Fix: auxiliary loss (Shazeer 2017):
L_aux = N × Σ_i fraction_i × probability_i
- fraction_i: i. expert'in toplam routing fraction'ı
- probability_i: i. expert'in ortalama gate probability'si
Minimize ile experts uniformly used.
2.6 Implementation outline#
class MoELayer(nn.Module): def __init__(self, d_model, n_experts, k=2): super().__init__() self.n_experts = n_experts self.k = k self.gate = nn.Linear(d_model, n_experts) self.experts = nn.ModuleList([ SwiGLUFFN(d_model) for _ in range(n_experts) ]) def forward(self, x): # x: [batch, seq, d_model] gates = F.softmax(self.gate(x), dim=-1) top_k_gates, top_k_idx = gates.topk(self.k, dim=-1) # Dispatch to top-k experts (simplified) out = torch.zeros_like(x) for i in range(self.n_experts): mask = (top_k_idx == i).any(dim=-1) if mask.any(): expert_out = self.experts[i](x[mask]) # Weight by gate probability gate_weight = (top_k_gates * (top_k_idx == i).float()).sum(-1)[mask] out[mask] += expert_out * gate_weight.unsqueeze(-1) return out
Production: Megatron-LM, DeepSpeed-MoE much more optimized.
7-10. Mixtral + DeepSeek-V3#
7.1 Mixtral 8x7B (Mistral AI, Ocak 2024)#
İlk major open-source MoE.
Config:
- 8 experts per FFN layer
- top-2 routing
- 47B total params, 13B active per token
- Quality: comparable Llama-2-70B (4x less compute)
'8x7B' yanıltıcı — base 7B değil, total 47B'de 8 expert.
7.2 Mixtral architectural#
for each transformer layer: Attention (dense) MoE FFN (8 experts, top-2) RMSNorm + residual
Attention dense (no sparse routing). FFN sparse.
7.3 DeepSeek-V3 (Aralık 2024)#
DeepSeek AI: 671B param MoE, active 37B.
Config:
- 256 experts per FFN layer
- top-8 routing
-
- 1 shared expert (always active)
- 671B total, 37B active per token
Finer-grained: 256 experts smaller, more specialization.
7.4 DeepSeek-V3 quality#
Benchmarks:
- MMLU: 88.5%
- MATH: 90.2%
- Coding (HumanEval): 82.3%
- Better than Llama-3-70B, comparable GPT-4o
- Training cost: $5.6M (extraordinarily efficient)
7.5 MoE inference#
Key: 'active params' inference cost.
Memory: full model loaded (671B for DeepSeek-V3, 47B for Mixtral).
Compute: dense-equivalent of active params.
DeepSeek-V3 inference: 37B dense quality at 37B compute, with 671B memory.
Serving math:
- Mixtral 47B model size: ~90 GB (bf16) → 2x A100 80GB
- Inference compute equivalent to 13B dense → much faster than 47B dense
7.6 Türkçe için MoE#
Mixtral Türkçe quality: decent (~Llama-3-8B level).
DeepSeek-V3 Türkçe: strong reasoning, math.
Self-host: cost expensive (multi-GPU).
API: DeepSeek-V3 ~$0.27/1M (cheap due to MoE efficiency).
7.7 Future MoE#
Trend: more, smaller experts (256+ in DeepSeek-V3 vs 8 in Mixtral).
- Better specialization
- More flexible routing
- Frontier model standard
🎉 Modül 18 Tamamlandı — MoE
Mixture of Experts: sparse activation devrim. N experts, top-k routing (typically k=2-8). Shazeer 2017 foundation. Mixtral 8x7B (Ocak 2024) open-source breakthrough. DeepSeek-V3 671B (Aralık 2024) frontier — 37B active, $5.6M training cost (extraordinary efficiency). Quality: GPT-4o-comparable open-source. Türkçe için: DeepSeek-V3 API cheap reasoning. Modül 18 envanteri: 1 ders, 75 dk. Genel müfredat: 19 modül, 90 ders, ~98.5 saat.
Modül 18 Envanteri (Tamamlandı)#
| # | Ders | Süre |
|---|---|---|
| 18.1 | MoE Mixtral + DeepSeek-V3 | 75 dk |
| Toplam | 1 ders | 75 dk |
Frequently Asked Questions
All experts needed at inference — which expert is active is determined at runtime. Full memory. Compute only on active experts.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup