Sparse Upcycling: Converting Dense Model to MoE — Qwen2-MoE Technique Reconstruction

Sparse Upcycling (Komatsuzaki et al. 2022) — convert dense pre-trained model to MoE then continue pre-training to specialize. Copy existing FFN N times, add router, continue training. Cheaper than scratch pre-train. Qwen 2.5 7B → 7B-MoE (8 expert) conversion lab on RTX 4090.

Şükrü Yusuf KAYA

28 min read

5/14/2026

Advanced

Sparse Upcycling: Dense Model'i MoE'ye Çevirme — Qwen2-MoE Technique Reconstruction

1. Sparse Upcycling Mantığı#

Pre-train scratch'tan çok pahalı (15T token, milyon dolarlar). Sparse upcycling alternatifi:

Adım 1: Mevcut dense model al (örn. Qwen 2.5 7B)
Adım 2: FFN'in 8 kopyasını oluştur (her biri identical)
Adım 3: Random init router ekle
Adım 4: Continual pre-train (~100B-500B token)
  - Router gradients ile öğrenir
  - Expert'ler diverge eder (specialization)
Adım 5: Tradiotional FT + alignment

Maliyet: Scratch pre-train'in %5-10'u. Kalite: dense model'den %5-15 üstün (capacity arttı).

python

# === Dense Qwen 2.5 7B → 7B-MoE Upcycling ===
import torch
from transformers import AutoModelForCausalLM
 
# 1. Load dense
dense = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype=torch.bfloat16,
)
 
# 2. Convert each FFN to MoE (8 expert copies)
import torch.nn as nn
 
class UpcycledMoE(nn.Module):
    def __init__(self, original_ffn, num_experts=8):
        super().__init__()
        self.num_experts = num_experts
        # 8 identical FFN copies
        self.experts = nn.ModuleList([
            type(original_ffn)(original_ffn.config)
            for _ in range(num_experts)
        ])
        for i, expert in enumerate(self.experts):
            expert.load_state_dict(original_ffn.state_dict())
 
        # Random init router
        self.router = nn.Linear(original_ffn.config.hidden_size, num_experts)
        torch.nn.init.normal_(self.router.weight, std=0.02)
 
    def forward(self, x):
        # x: [batch, seq, hidden]
        b, s, h = x.shape
        x_flat = x.view(-1, h)
 
        # Router
        logits = self.router(x_flat)
        top_k = 2
        top_k_logits, top_k_indices = logits.topk(top_k, dim=-1)
        weights = torch.softmax(top_k_logits, dim=-1)
 
        # Aggregate expert outputs
        out = torch.zeros_like(x_flat)
        for i in range(self.num_experts):
            mask = (top_k_indices == i).any(dim=-1)
            if mask.any():
                expert_out = self.experts[i](x_flat[mask])
                w = weights[mask][(top_k_indices[mask] == i)].unsqueeze(-1)
                out[mask] += w * expert_out
        return out.view(b, s, h)
 
# Replace each FFN
for layer in dense.model.layers:
    original_ffn = layer.mlp
    layer.mlp = UpcycledMoE(original_ffn, num_experts=8)
 
# Now: continual pre-train with 50-100B token (cookbook-suggested)
# Expert'ler diverge edecek + router learn edecek

Sparse upcycling — dense → MoE conversion code

✅ Teslim

Qwen2.5 1.5B'yi 8-expert MoE'ye upcycle et. 2) 100M token continual PT. 3) Expert balance metrics. 4) Sonraki ders: 5.6 — Expert Specialization Probe.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Sparse Upcycling: Converting Dense Model to MoE — Qwen2-MoE Technique Reconstruction

1. Sparse Upcycling Mantığı#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter