MoE Quantization & Inference: Expert Offload + Dynamic Routing Under Quant

MoE inference differs from dense: some experts 'cold' (rarely used) → CPU/disk offload. Dynamic routing × quantization interaction (router's quant tolerance), MoE-specific vLLM tuning, Mixtral AWQ + sparse expert loading. Mixtral 8×7B serving on RTX 4090 (~140 tok/s).

Şükrü Yusuf KAYA

28 min read

5/14/2026

Advanced

MoE Quantization & Inference: Expert Offload + Dynamic Routing Under Quant

1. Mixtral 8×7B RTX 4090'da Çalıştırmak#

Mixtral 8×7B = 46.7B total params:

bf16: 93 GB → 4090'a sığmaz
AWQ int4: 24 GB → 4090'a sığar marjinal
- GGUF Q4_K_M: 26 GB (4090'da sığmaz çünkü no offload)
- Expert CPU offload (cold expert'ları RAM'e): 12-18 GB GPU + 64 GB RAM

# vLLM ile Mixtral AWQ
vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
    --quantization awq \
    --gpu-memory-utilization 0.95 \
    --max-model-len 4096
# 22 GB VRAM kullanır

# Throughput (RTX 4090):
# - Single user: 95 tok/s
# - batch=8: 380 tok/s
# - batch=32: OOM (capacity limit)

2. Cold Expert Offload#

Reality: Mixtral 8×7B'de bazı expert'ler %30 zamanda kullanılır, bazıları %5. Tüm expert'leri GPU'da tutmak gereksiz.

Strateji:

Top-2 hot expert (her layer'da en çok kullanılan 2 expert) → GPU
Cold expert'ler → CPU RAM
Bir cold expert çağrılırsa GPU'ya page'le, sonra geri

Kütüphane:

mixtral-offloading

(Eliseev & Mazur, 2024)

from mixtral_offloading import MixtralForCausalLM

model = MixtralForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-Instruct-v0.1",
    device_map="cuda:0",
    expert_offload="cpu",
    hot_experts_per_layer=2,                    # GPU'da hangi expert'ler
    cache_size=4,                                # LRU eviction
)
# RTX 4090 + 64GB RAM: rahat çalışır, ~12 GB VRAM
# Throughput: ~25-40 tok/s (cold cache miss'lerde yavaşlar)

✅ Part V tamamlandı

Mixtral AWQ ile vLLM serving denesi. 2) Specialization probe + Mixture analizi. 3) Sonraki Part: Part VI — Vision-Language Multimodal FT (Llama 3.2 Vision, Qwen 2.5-VL, Pixtral).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

MoE Quantization & Inference: Expert Offload + Dynamic Routing Under Quant

1. Mixtral 8×7B RTX 4090'da Çalıştırmak#

2. Cold Expert Offload#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter