İçeriğe geç

Mixture of Experts (MoE): Sparse Activation Devrim — Mixtral 8x7B'den DeepSeek-V3'e

Mixture of Experts (MoE) mimarisi: sparse activation, expert routing (top-k gating), Mixtral 8x7B (Ocak 2024) açık-kaynak devrim, DeepSeek-V3 671B (Aralık 2024) frontier. Routing math (Shazeer 2017 outrageously sparse), auxiliary loss, load balancing. Memory-efficient frontier scale modeller.

Şükrü Yusuf KAYA
75 dakikalık okuma
İleri
Mixture of Experts (MoE): Sparse Activation Devrim — Mixtral 8x7B'den DeepSeek-V3'e
🎭 MoE — frontier modellerin gizli silahı
DeepSeek-V3 (Aralık 2024): 671 milyar parametre. AMA inference'da sadece 37 milyar aktif. Memory 671B, compute 37B. 'En iyisinin en iyisi'. Mekanizma: Mixture of Experts (MoE). Shazeer 2017 'outrageously large neural networks' paper'ı. Mixtral 8x7B (Ocak 2024) açık-kaynak'da yaygınlaştı. GPT-4 (tahmini MoE), Gemini 1.5 — frontier model standardı. 75 dakika sonra: MoE matematiksel anatomisini, expert routing'i, Mixtral ve DeepSeek-V3 production detaylarını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. Dense vs sparse model — niye MoE
  2. Shazeer 2017 — original MoE paper
  3. Top-k gating — routing mechanism
  4. Expert FFN — each expert is small FFN
  5. Routing math — softmax → top-k → weighted sum
  6. Auxiliary loss — load balancing
  7. Mixtral 8x7B (Ocak 2024) — open-source breakthrough
  8. DeepSeek-V3 671B — frontier scale MoE
  9. MoE inference — vLLM, dense-equivalent FLOPs
  10. Türkçe için — pratik implications

2-6. MoE Math#

2.1 Dense FFN recap#

Standard transformer block:
FFN(x) = down(silu(gate(x)) * up(x))
Dense: tüm params (d_ff) her token için aktif.

2.2 MoE intuition#

Replace single FFN with N experts (each own FFN):
MoE(x) = Σ_i gate(x)_i × expert_i(x)
gate produces N-dimensional probability over experts. Each expert is independent FFN.

2.3 Top-k sparse activation#

Key: only top-k experts activated per token (typically k=2):
weights = softmax(linear(x)) # [B, S, N] top_k_weights, top_k_indices = topk(weights, k=2) # only k experts out = Σ_{i in top_k} top_k_weights_i × expert_i(x)
N experts varsa, sadece k tanesi compute → sparse activation.
Memory: N × d_ff (all experts loaded) Compute: k × d_ff (only top-k active)
N=8, k=2: 4x more memory, same compute as dense FFN.

2.4 Per-token routing#

Önemli: routing per-token. Aynı sequence içinde farklı tokens farklı experts.
Example: 'kapıyı çal' cümlesinde:
  • 'kapı' → experts 1, 3
  • 'yı' → experts 2, 5
  • 'çal' → experts 1, 7
Fiziksel: model 'specialization' — bazı experts grammar, bazıları math, bazıları code öğrenir.

2.5 Load balancing problem#

Naive: bazı experts hep aktif, bazıları hiç. Underutilization.
Fix: auxiliary loss (Shazeer 2017):
L_aux = N × Σ_i fraction_i × probability_i
  • fraction_i: i. expert'in toplam routing fraction'ı
  • probability_i: i. expert'in ortalama gate probability'si
Minimize ile experts uniformly used.

2.6 Implementation outline#

class MoELayer(nn.Module): def __init__(self, d_model, n_experts, k=2): super().__init__() self.n_experts = n_experts self.k = k self.gate = nn.Linear(d_model, n_experts) self.experts = nn.ModuleList([ SwiGLUFFN(d_model) for _ in range(n_experts) ]) def forward(self, x): # x: [batch, seq, d_model] gates = F.softmax(self.gate(x), dim=-1) top_k_gates, top_k_idx = gates.topk(self.k, dim=-1) # Dispatch to top-k experts (simplified) out = torch.zeros_like(x) for i in range(self.n_experts): mask = (top_k_idx == i).any(dim=-1) if mask.any(): expert_out = self.experts[i](x[mask]) # Weight by gate probability gate_weight = (top_k_gates * (top_k_idx == i).float()).sum(-1)[mask] out[mask] += expert_out * gate_weight.unsqueeze(-1) return out
Production: Megatron-LM, DeepSpeed-MoE much more optimized.

7-10. Mixtral + DeepSeek-V3#

7.1 Mixtral 8x7B (Mistral AI, Ocak 2024)#

İlk major open-source MoE. Config:
  • 8 experts per FFN layer
  • top-2 routing
  • 47B total params, 13B active per token
  • Quality: comparable Llama-2-70B (4x less compute)
'8x7B' yanıltıcı — base 7B değil, total 47B'de 8 expert.

7.2 Mixtral architectural#

for each transformer layer: Attention (dense) MoE FFN (8 experts, top-2) RMSNorm + residual
Attention dense (no sparse routing). FFN sparse.

7.3 DeepSeek-V3 (Aralık 2024)#

DeepSeek AI: 671B param MoE, active 37B.
Config:
  • 256 experts per FFN layer
  • top-8 routing
    • 1 shared expert (always active)
  • 671B total, 37B active per token
Finer-grained: 256 experts smaller, more specialization.

7.4 DeepSeek-V3 quality#

Benchmarks:
  • MMLU: 88.5%
  • MATH: 90.2%
  • Coding (HumanEval): 82.3%
  • Better than Llama-3-70B, comparable GPT-4o
  • Training cost: $5.6M (extraordinarily efficient)

7.5 MoE inference#

Key: 'active params' inference cost. Memory: full model loaded (671B for DeepSeek-V3, 47B for Mixtral). Compute: dense-equivalent of active params.
DeepSeek-V3 inference: 37B dense quality at 37B compute, with 671B memory.
Serving math:
  • Mixtral 47B model size: ~90 GB (bf16) → 2x A100 80GB
  • Inference compute equivalent to 13B dense → much faster than 47B dense

7.6 Türkçe için MoE#

Mixtral Türkçe quality: decent (~Llama-3-8B level). DeepSeek-V3 Türkçe: strong reasoning, math. Self-host: cost expensive (multi-GPU). API: DeepSeek-V3 ~$0.27/1M (cheap due to MoE efficiency).

7.7 Future MoE#

Trend: more, smaller experts (256+ in DeepSeek-V3 vs 8 in Mixtral).
  • Better specialization
  • More flexible routing
  • Frontier model standard
🎉 Modül 18 Tamamlandı — MoE
Mixture of Experts: sparse activation devrim. N experts, top-k routing (typically k=2-8). Shazeer 2017 foundation. Mixtral 8x7B (Ocak 2024) open-source breakthrough. DeepSeek-V3 671B (Aralık 2024) frontier — 37B active, $5.6M training cost (extraordinary efficiency). Quality: GPT-4o-comparable open-source. Türkçe için: DeepSeek-V3 API cheap reasoning. Modül 18 envanteri: 1 ders, 75 dk. Genel müfredat: 19 modül, 90 ders, ~98.5 saat.

Modül 18 Envanteri (Tamamlandı)#

#DersSüre
18.1MoE Mixtral + DeepSeek-V375 dk
Toplam1 ders75 dk

Sık Sorulan Sorular

Tüm experts inference'da gerekli — hangi expert active olacağı runtime'da belirlenir. Memory full. Compute sadece active expert'larda.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler