Sparse Upcycling: Converting Dense Model to MoE — Qwen2-MoE Technique Reconstruction
Sparse Upcycling (Komatsuzaki et al. 2022) — convert dense pre-trained model to MoE then continue pre-training to specialize. Copy existing FFN N times, add router, continue training. Cheaper than scratch pre-train. Qwen 2.5 7B → 7B-MoE (8 expert) conversion lab on RTX 4090.
Şükrü Yusuf KAYA
28 min read
Advanced1. Sparse Upcycling Mantığı#
Pre-train scratch'tan çok pahalı (15T token, milyon dolarlar). Sparse upcycling alternatifi:
Adım 1: Mevcut dense model al (örn. Qwen 2.5 7B) Adım 2: FFN'in 8 kopyasını oluştur (her biri identical) Adım 3: Random init router ekle Adım 4: Continual pre-train (~100B-500B token) - Router gradients ile öğrenir - Expert'ler diverge eder (specialization) Adım 5: Tradiotional FT + alignment
Maliyet: Scratch pre-train'in %5-10'u. Kalite: dense model'den %5-15 üstün (capacity arttı).
python
# === Dense Qwen 2.5 7B → 7B-MoE Upcycling ===import torchfrom transformers import AutoModelForCausalLM # 1. Load densedense = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-7B-Instruct", torch_dtype=torch.bfloat16,) # 2. Convert each FFN to MoE (8 expert copies)import torch.nn as nn class UpcycledMoE(nn.Module): def __init__(self, original_ffn, num_experts=8): super().__init__() self.num_experts = num_experts # 8 identical FFN copies self.experts = nn.ModuleList([ type(original_ffn)(original_ffn.config) for _ in range(num_experts) ]) for i, expert in enumerate(self.experts): expert.load_state_dict(original_ffn.state_dict()) # Random init router self.router = nn.Linear(original_ffn.config.hidden_size, num_experts) torch.nn.init.normal_(self.router.weight, std=0.02) def forward(self, x): # x: [batch, seq, hidden] b, s, h = x.shape x_flat = x.view(-1, h) # Router logits = self.router(x_flat) top_k = 2 top_k_logits, top_k_indices = logits.topk(top_k, dim=-1) weights = torch.softmax(top_k_logits, dim=-1) # Aggregate expert outputs out = torch.zeros_like(x_flat) for i in range(self.num_experts): mask = (top_k_indices == i).any(dim=-1) if mask.any(): expert_out = self.experts[i](x_flat[mask]) w = weights[mask][(top_k_indices[mask] == i)].unsqueeze(-1) out[mask] += w * expert_out return out.view(b, s, h) # Replace each FFNfor layer in dense.model.layers: original_ffn = layer.mlp layer.mlp = UpcycledMoE(original_ffn, num_experts=8) # Now: continual pre-train with 50-100B token (cookbook-suggested)# Expert'ler diverge edecek + router learn edecekSparse upcycling — dense → MoE conversion code
✅ Teslim
- Qwen2.5 1.5B'yi 8-expert MoE'ye upcycle et. 2) 100M token continual PT. 3) Expert balance metrics. 4) Sonraki ders: 5.6 — Expert Specialization Probe.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations