Skip to content

Sparse Upcycling: Converting Dense Model to MoE — Qwen2-MoE Technique Reconstruction

Sparse Upcycling (Komatsuzaki et al. 2022) — convert dense pre-trained model to MoE then continue pre-training to specialize. Copy existing FFN N times, add router, continue training. Cheaper than scratch pre-train. Qwen 2.5 7B → 7B-MoE (8 expert) conversion lab on RTX 4090.

Şükrü Yusuf KAYA
28 min read
Advanced
Sparse Upcycling: Dense Model'i MoE'ye Çevirme — Qwen2-MoE Technique Reconstruction

1. Sparse Upcycling Mantığı#

Pre-train scratch'tan çok pahalı (15T token, milyon dolarlar). Sparse upcycling alternatifi:
Adım 1: Mevcut dense model al (örn. Qwen 2.5 7B) Adım 2: FFN'in 8 kopyasını oluştur (her biri identical) Adım 3: Random init router ekle Adım 4: Continual pre-train (~100B-500B token) - Router gradients ile öğrenir - Expert'ler diverge eder (specialization) Adım 5: Tradiotional FT + alignment
Maliyet: Scratch pre-train'in %5-10'u. Kalite: dense model'den %5-15 üstün (capacity arttı).
python
# === Dense Qwen 2.5 7B → 7B-MoE Upcycling ===
import torch
from transformers import AutoModelForCausalLM
 
# 1. Load dense
dense = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype=torch.bfloat16,
)
 
# 2. Convert each FFN to MoE (8 expert copies)
import torch.nn as nn
 
class UpcycledMoE(nn.Module):
def __init__(self, original_ffn, num_experts=8):
super().__init__()
self.num_experts = num_experts
# 8 identical FFN copies
self.experts = nn.ModuleList([
type(original_ffn)(original_ffn.config)
for _ in range(num_experts)
])
for i, expert in enumerate(self.experts):
expert.load_state_dict(original_ffn.state_dict())
 
# Random init router
self.router = nn.Linear(original_ffn.config.hidden_size, num_experts)
torch.nn.init.normal_(self.router.weight, std=0.02)
 
def forward(self, x):
# x: [batch, seq, hidden]
b, s, h = x.shape
x_flat = x.view(-1, h)
 
# Router
logits = self.router(x_flat)
top_k = 2
top_k_logits, top_k_indices = logits.topk(top_k, dim=-1)
weights = torch.softmax(top_k_logits, dim=-1)
 
# Aggregate expert outputs
out = torch.zeros_like(x_flat)
for i in range(self.num_experts):
mask = (top_k_indices == i).any(dim=-1)
if mask.any():
expert_out = self.experts[i](x_flat[mask])
w = weights[mask][(top_k_indices[mask] == i)].unsqueeze(-1)
out[mask] += w * expert_out
return out.view(b, s, h)
 
# Replace each FFN
for layer in dense.model.layers:
original_ffn = layer.mlp
layer.mlp = UpcycledMoE(original_ffn, num_experts=8)
 
# Now: continual pre-train with 50-100B token (cookbook-suggested)
# Expert'ler diverge edecek + router learn edecek
Sparse upcycling — dense → MoE conversion code
✅ Teslim
  1. Qwen2.5 1.5B'yi 8-expert MoE'ye upcycle et. 2) 100M token continual PT. 3) Expert balance metrics. 4) Sonraki ders: 5.6 — Expert Specialization Probe.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content