İçeriğe geç

Sparse Upcycling: Dense Model'i MoE'ye Çevirme — Qwen2-MoE Technique Reconstruction

Sparse Upcycling (Komatsuzaki et al. 2022) — dense pre-trained model'i MoE'ye çevirip continual pre-train ile uzmanlaştırma. Mevcut FFN'i N kez kopyala, router ekle, training devam et. Pre-train'in scratch'tan çok daha ucuz. RTX 4090'da Qwen 2.5 7B → 7B-MoE (8 expert) conversion lab.

Şükrü Yusuf KAYA
28 dakikalık okuma
İleri
Sparse Upcycling: Dense Model'i MoE'ye Çevirme — Qwen2-MoE Technique Reconstruction

1. Sparse Upcycling Mantığı#

Pre-train scratch'tan çok pahalı (15T token, milyon dolarlar). Sparse upcycling alternatifi:
Adım 1: Mevcut dense model al (örn. Qwen 2.5 7B) Adım 2: FFN'in 8 kopyasını oluştur (her biri identical) Adım 3: Random init router ekle Adım 4: Continual pre-train (~100B-500B token) - Router gradients ile öğrenir - Expert'ler diverge eder (specialization) Adım 5: Tradiotional FT + alignment
Maliyet: Scratch pre-train'in %5-10'u. Kalite: dense model'den %5-15 üstün (capacity arttı).
python
# === Dense Qwen 2.5 7B → 7B-MoE Upcycling ===
import torch
from transformers import AutoModelForCausalLM
 
# 1. Load dense
dense = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype=torch.bfloat16,
)
 
# 2. Convert each FFN to MoE (8 expert copies)
import torch.nn as nn
 
class UpcycledMoE(nn.Module):
def __init__(self, original_ffn, num_experts=8):
super().__init__()
self.num_experts = num_experts
# 8 identical FFN copies
self.experts = nn.ModuleList([
type(original_ffn)(original_ffn.config)
for _ in range(num_experts)
])
for i, expert in enumerate(self.experts):
expert.load_state_dict(original_ffn.state_dict())
 
# Random init router
self.router = nn.Linear(original_ffn.config.hidden_size, num_experts)
torch.nn.init.normal_(self.router.weight, std=0.02)
 
def forward(self, x):
# x: [batch, seq, hidden]
b, s, h = x.shape
x_flat = x.view(-1, h)
 
# Router
logits = self.router(x_flat)
top_k = 2
top_k_logits, top_k_indices = logits.topk(top_k, dim=-1)
weights = torch.softmax(top_k_logits, dim=-1)
 
# Aggregate expert outputs
out = torch.zeros_like(x_flat)
for i in range(self.num_experts):
mask = (top_k_indices == i).any(dim=-1)
if mask.any():
expert_out = self.experts[i](x_flat[mask])
w = weights[mask][(top_k_indices[mask] == i)].unsqueeze(-1)
out[mask] += w * expert_out
return out.view(b, s, h)
 
# Replace each FFN
for layer in dense.model.layers:
original_ffn = layer.mlp
layer.mlp = UpcycledMoE(original_ffn, num_experts=8)
 
# Now: continual pre-train with 50-100B token (cookbook-suggested)
# Expert'ler diverge edecek + router learn edecek
Sparse upcycling — dense → MoE conversion code
✅ Teslim
  1. Qwen2.5 1.5B'yi 8-expert MoE'ye upcycle et. 2) 100M token continual PT. 3) Expert balance metrics. 4) Sonraki ders: 5.6 — Expert Specialization Probe.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler