Why does MoE inference need 47B memory for Mixtral?

Mixture of Experts (MoE): Sparse Activation Revolution — From Mixtral 8x7B to DeepSeek-V3

Mixture of Experts (MoE) architecture: sparse activation, expert routing (top-k gating), Mixtral 8x7B (Jan 2024) open-source revolution, DeepSeek-V3 671B (Dec 2024) frontier. Routing math (Shazeer 2017 outrageously sparse), auxiliary loss, load balancing. Memory-efficient frontier scale models.

Şükrü Yusuf KAYA

75 min read

5/13/2026

Advanced

Mixture of Experts (MoE): Sparse Activation Devrim — Mixtral 8x7B'den DeepSeek-V3'e

🎭 MoE — frontier modellerin gizli silahı

DeepSeek-V3 (Aralık 2024): 671 milyar parametre. AMA inference'da sadece 37 milyar aktif. Memory 671B, compute 37B. 'En iyisinin en iyisi'. Mekanizma: Mixture of Experts (MoE). Shazeer 2017 'outrageously large neural networks' paper'ı. Mixtral 8x7B (Ocak 2024) açık-kaynak'da yaygınlaştı. GPT-4 (tahmini MoE), Gemini 1.5 — frontier model standardı. 75 dakika sonra: MoE matematiksel anatomisini, expert routing'i, Mixtral ve DeepSeek-V3 production detaylarını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

Dense vs sparse model — niye MoE
Shazeer 2017 — original MoE paper
Top-k gating — routing mechanism
Expert FFN — each expert is small FFN
Routing math — softmax → top-k → weighted sum
Auxiliary loss — load balancing
Mixtral 8x7B (Ocak 2024) — open-source breakthrough
DeepSeek-V3 671B — frontier scale MoE
MoE inference — vLLM, dense-equivalent FLOPs
Türkçe için — pratik implications

2-6. MoE Math#

2.1 Dense FFN recap#

Standard transformer block:

FFN(x) = down(silu(gate(x)) * up(x))

Dense: tüm params (d_ff) her token için aktif.

2.2 MoE intuition#

Replace single FFN with N experts (each own FFN):

MoE(x) = Σ_i gate(x)_i × expert_i(x)

gate produces N-dimensional probability over experts. Each expert is independent FFN.

2.3 Top-k sparse activation#

Key: only top-k experts activated per token (typically k=2):

weights = softmax(linear(x))  # [B, S, N]
top_k_weights, top_k_indices = topk(weights, k=2)  # only k experts
out = Σ_{i in top_k} top_k_weights_i × expert_i(x)

N experts varsa, sadece k tanesi compute → sparse activation.

Memory: N × d_ff (all experts loaded) Compute: k × d_ff (only top-k active)

N=8, k=2: 4x more memory, same compute as dense FFN.

2.4 Per-token routing#

Önemli: routing per-token. Aynı sequence içinde farklı tokens farklı experts.

Example: 'kapıyı çal' cümlesinde:

'kapı' → experts 1, 3
'yı' → experts 2, 5
'çal' → experts 1, 7

Fiziksel: model 'specialization' — bazı experts grammar, bazıları math, bazıları code öğrenir.

2.5 Load balancing problem#

Naive: bazı experts hep aktif, bazıları hiç. Underutilization.

Fix: auxiliary loss (Shazeer 2017):

L_aux = N × Σ_i fraction_i × probability_i

fraction_i: i. expert'in toplam routing fraction'ı
probability_i: i. expert'in ortalama gate probability'si

Minimize ile experts uniformly used.

2.6 Implementation outline#

class MoELayer(nn.Module):
    def __init__(self, d_model, n_experts, k=2):
        super().__init__()
        self.n_experts = n_experts
        self.k = k
        self.gate = nn.Linear(d_model, n_experts)
        self.experts = nn.ModuleList([
            SwiGLUFFN(d_model) for _ in range(n_experts)
        ])
    
    def forward(self, x):
        # x: [batch, seq, d_model]
        gates = F.softmax(self.gate(x), dim=-1)
        top_k_gates, top_k_idx = gates.topk(self.k, dim=-1)
        
        # Dispatch to top-k experts (simplified)
        out = torch.zeros_like(x)
        for i in range(self.n_experts):
            mask = (top_k_idx == i).any(dim=-1)
            if mask.any():
                expert_out = self.experts[i](x[mask])
                # Weight by gate probability
                gate_weight = (top_k_gates * (top_k_idx == i).float()).sum(-1)[mask]
                out[mask] += expert_out * gate_weight.unsqueeze(-1)
        return out

Production: Megatron-LM, DeepSpeed-MoE much more optimized.

7-10. Mixtral + DeepSeek-V3#

7.1 Mixtral 8x7B (Mistral AI, Ocak 2024)#

İlk major open-source MoE. Config:

8 experts per FFN layer
top-2 routing
47B total params, 13B active per token
Quality: comparable Llama-2-70B (4x less compute)

'8x7B' yanıltıcı — base 7B değil, total 47B'de 8 expert.

7.2 Mixtral architectural#

for each transformer layer:
  Attention (dense)
  MoE FFN (8 experts, top-2)
  RMSNorm + residual

Attention dense (no sparse routing). FFN sparse.

7.3 DeepSeek-V3 (Aralık 2024)#

DeepSeek AI: 671B param MoE, active 37B.

Config:

256 experts per FFN layer
top-8 routing
- 1 shared expert (always active)
671B total, 37B active per token

Finer-grained: 256 experts smaller, more specialization.

7.4 DeepSeek-V3 quality#

Benchmarks:

MMLU: 88.5%
MATH: 90.2%
Coding (HumanEval): 82.3%
Better than Llama-3-70B, comparable GPT-4o
Training cost: $5.6M (extraordinarily efficient)

7.5 MoE inference#

Key: 'active params' inference cost. Memory: full model loaded (671B for DeepSeek-V3, 47B for Mixtral). Compute: dense-equivalent of active params.

DeepSeek-V3 inference: 37B dense quality at 37B compute, with 671B memory.

Serving math:

Mixtral 47B model size: ~90 GB (bf16) → 2x A100 80GB
Inference compute equivalent to 13B dense → much faster than 47B dense

7.6 Türkçe için MoE#

Mixtral Türkçe quality: decent (~Llama-3-8B level). DeepSeek-V3 Türkçe: strong reasoning, math. Self-host: cost expensive (multi-GPU). API: DeepSeek-V3 ~$0.27/1M (cheap due to MoE efficiency).

7.7 Future MoE#

Trend: more, smaller experts (256+ in DeepSeek-V3 vs 8 in Mixtral).

Better specialization
More flexible routing
Frontier model standard

🎉 Modül 18 Tamamlandı — MoE

Mixture of Experts: sparse activation devrim. N experts, top-k routing (typically k=2-8). Shazeer 2017 foundation. Mixtral 8x7B (Ocak 2024) open-source breakthrough. DeepSeek-V3 671B (Aralık 2024) frontier — 37B active, $5.6M training cost (extraordinary efficiency). Quality: GPT-4o-comparable open-source. Türkçe için: DeepSeek-V3 API cheap reasoning. Modül 18 envanteri: 1 ders, 75 dk. Genel müfredat: 19 modül, 90 ders, ~98.5 saat.

Modül 18 Envanteri (Tamamlandı)#

#	Ders	Süre
18.1	MoE Mixtral + DeepSeek-V3	75 dk
Toplam	1 ders	75 dk

Frequently Asked Questions

All experts needed at inference — which expert is active is determined at runtime. Full memory. Compute only on active experts.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...