Qwen3-MoE + Llama-4-MoE Pattern: Generic MoE FT Recipe (8×H100 Baseline)

Qwen3-MoE (30B-A3B, 235B-A22B) and Llama-4-MoE (Behemoth, Maverick, Scout) — 2025's new MoE generation. 'Generic MoE FT pattern' — apply same discipline to any MoE. Common chat template, router-aware LoRA, expert-targeted SFT. 8×H100 baseline recipe.

Şükrü Yusuf KAYA

26 min read

5/14/2026

Advanced

Qwen3-MoE + Llama-4-MoE Pattern: Generic MoE FT Reçetesi (8×H100 Baseline)

1. Yeni Jenerasyon MoE'ler (2025-2026)#

Model	Total params	Active	Experts	Top-K	RTX 4090 Lab?
Mixtral 8×7B	46.7B	12.9B	8	2	QLoRA marjinal (~22 GB)
Mixtral 8×22B	141B	39B	8	2	Cloud only
DeepSeek-V3	671B	37B	256+1	8	Cloud (16×H100)
Qwen3 30B-A3B	30B	3B	128	8	QLoRA marjinal (~16 GB)
Qwen3 235B-A22B	235B	22B	128	8	Cloud (8×H100)
Llama-4 Scout	109B	17B	16	1	Cloud (4×H100)
Llama-4 Maverick	400B	17B	128	1	Cloud (16×H100)
Llama-4 Behemoth (preview)	2T	288B	16	1	Cloud (64×H100+)

Karar: Yeni başlayanlar için Mixtral 8×7B veya Qwen3 30B-A3B — ikisi de 1-2× consumer GPU + cloud-spillover ile yapılabilir.

✅ Teslim

Qwen3 30B-A3B'yi 1×H100 80GB ile mini-FT et. 2) Aux loss balance metrics monitor. 3) Sonraki ders: 5.5 — Sparse Upcycling (Dense → MoE).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Qwen3-MoE + Llama-4-MoE Pattern: Generic MoE FT Recipe (8×H100 Baseline)

1. Yeni Jenerasyon MoE'ler (2025-2026)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter