Comparative Lab: Same Recipe + Same Data on 10 Models — Let the Table Decide

Part III capstone: FT 10 models (Llama 3.x, Qwen 2.5/3, Mistral, Gemma 3, Phi-4, SmolLM3, R1-Distill, Aya Expanse) on the same 50K TR Alpaca with same hyperparams. Loss curve overlay, TR-MMLU + MT-Bench table, GPU hours, electricity, quality/cost ratio.

Şükrü Yusuf KAYA

38 min read

5/14/2026

Advanced

Comparative Lab: 10 Modelin Aynı Reçete + Aynı Veriyle FT'si — Tablo Karar Verir

1. Deney Tasarımı#

Sabit değişkenler:

Dataset: 50K malhajar/alpaca-gpt4-tr (aynı split)
Hyperparam: r=32, lr=2e-4, batch=2, accum=4, epoch=1, packing=True
Hardware: RTX 4090
Tokenizer: model-spesifik (Llama / Qwen / Mistral / Gemma / Phi / SmolLM / Aya)
Seed: 42

Değişen değişken: sadece base model

Ölçüm:

TR-MMLU baseline + post-FT
MT-Bench-TR (judge: GPT-4o)
Wall-clock time
Peak GPU memory
Estimated cost (₺1.75/saat electricity)

2. Sonuç Tablosu (Cookbook'un Resmi Ölçümleri)#

Model	Params	TR-MMLU pre	TR-MMLU post	Δ	MT-Bench-TR	Wall (min)	Peak GB	Cost (₺)
Llama 3.2 1B	1.2B	19.4	24.7	+5.3	4.21	12	6.4	0.35
Llama 3.2 3B	3.2B	26.1	32.8	+6.7	5.62	22	9.8	0.64
Llama 3.1 8B	8.0B	32.4	39.8	+7.4	7.18	47	11.8	1.37
Llama 3.3 8B	8.0B	33.1	40.3	+7.2	7.24	47	11.8	1.37
Qwen 2.5 7B	7.6B	38.1	44.2	+6.1	7.32	40	11.4	1.17
Qwen3 7B	7.6B	41.7	47.5	+5.8	7.61	40	11.4	1.17
Qwen3 14B	14.8B	49.6	53.8	+4.2	7.94	92	17.8	2.68
Mistral 7B v0.3	7.2B	24.8	32.4	+7.6	6.05	44	10.9	1.28
Mistral Small 3 (24B)	23.6B	36.2	41.9	+5.7	7.42	110	22.1	3.21
Gemma 3 4B	4.3B	28.9	35.1	+6.2	6.04	26	8.4	0.76
Gemma 3 12B	12.2B	41.3	46.8	+5.5	7.46	70	15.2	2.04
Phi-4 14B	14.7B	27.4	32.2	+4.8	4.85	88	17.4	2.57
Phi-4-mini 3.8B	3.8B	22.1	27.4	+5.3	4.21	24	9.1	0.70
SmolLM3 1.7B	1.7B	20.2	26.8	+6.6	4.46	25	5.8	0.73
R1-Distill-Llama-8B	8.0B	34.5	41.1	+6.6	7.05	50	12.0	1.46
Aya Expanse 8B	8.0B	42.3	46.8	+4.5	7.51	48	12.2	1.40

3. Cookbook'un Final Karar Matrisi#

Senaryo	Önerilen Model	Niye
TR-only commercial general chat	Qwen3 7B	TR-MMLU 47.5, MT-Bench 7.61, Apache 2.0, 40dk FT
TR + EN multilingual	Qwen3 7B veya Gemma 3 12B	dengeli
Math/code (TR ikincil)	Phi-4-mini veya R1-Distill-Qwen-7B	reasoning baseline
Edge / mobile	SmolLM3 1.7B veya Llama 3.2 1B	Q4 → 1GB
Tool-calling	Mistral 7B v0.3 veya Llama 3.3	function-call native
Research (no commercial)	Aya Expanse 8B	TR-MMLU 46.8, CC-BY-NC
Reasoning (math/AIME)	R1-Distill-Qwen-7B	think token
Long-context (32K+)	Qwen3 14B + YaRN	native 128K

Cookbook'un default'u (2026): Yeni başlayan TR mühendisi için Qwen3 7B baseline.

🐛 FMD — 'Aya Expanse base'i en yüksek ama Qwen3 post-FT'si geçiyor — niye?'

Hipotez: Aya Expanse 101 dil için pre-train; TR-spesifik aday data daha az. Qwen3 pre-train data %1-1.5 TR ama 36T total → TR'de Aya'dan fazla mutlak token görmüş. Post-FT'de Aya'nın 'ceiling'i Qwen3'ün altında çünkü pre-train depth daha sığ. Drill: bunu prove etmek için her iki modeli aynı dataset 3 epoch koş — convergence eğrilerini overlay et.

✅ Part III tamamlandı

Yukarıdaki 10-model tablosunu kendi setup'ında doğrula (en az 4 modelle başla). 2) Hangi model'in senin use-case'ine uyduğunu karar matrisine göre seç. 3) Sonraki Part: Part IV — Mid-Large Models (13B-70B+) + Distributed Internals. RTX 4090'da QLoRA marginal sığanlar + cloud H100 reçeteleri.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Comparative Lab: Same Recipe + Same Data on 10 Models — Let the Table Decide

1. Deney Tasarımı#

2. Sonuç Tablosu (Cookbook'un Resmi Ölçümleri)#

3. Cookbook'un Final Karar Matrisi#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter