Round-trip Eval: Pre/Post Quant Tablo — TR-MMLU + MT-Bench + Niş Benchmark

Cookbook'un Part X capstone'u: aynı modeli bf16, AWQ int4, GPTQ int4, EXL2 4.5bpw, GGUF Q4_K_M, FP8 olarak quantize et ve karşılaştır. TR-MMLU, MT-Bench-TR, niş custom benchmark (Türkçe çağrı merkezi sample). Karar matrisi: hangi quant senin use-case'ine?

Şükrü Yusuf KAYA

30 dakikalık okuma

14.05.2026

İleri

Round-trip Eval: Pre/Post Quant Tablo — TR-MMLU + MT-Bench + Niş Benchmark

1. Llama 3.1 8B-Instruct Comprehensive Quant Tablosu#

RTX 4090 + 256 sample TR calibration.

Quant	Size (GB)	TR-MMLU	MT-Bench-TR	WikiText PPL	Tok/s (batch=1)	Tok/s (batch=16)
bf16 (reference)	16.0	32.4	6.42	5.93	95	540
AWQ int4	4.4	32.0 (-0.4)	6.30 (-0.12)	5.99	175	920
GPTQ int4	4.5	31.8	6.25	6.04	165	870
EXL2 4.5bpw	4.6	32.1	6.32	5.97	245	140
GGUF Q4_K_M	4.6	31.6	6.18	6.04	75 (CPU 22)	n/a
GGUF Q5_K_M	5.4	32.2	6.36	5.96	70 (CPU 18)	n/a
FP8	8.0	32.3 (-0.1)	6.38 (-0.04)	5.95	155	1080

Karar matrisi:

Use case	Önerilen	Niye
Yüksek kalite + production serving	FP8	min kalite kaybı + batch throughput
Single user lokal chat	EXL2 4.5bpw	en hızlı batch=1
Multi-user budget API	AWQ int4	en küçük + iyi batch throughput
Mobile / CPU / edge	GGUF Q4_K_M	lokal cihazlarda hızlı
Test / dev	bf16	reference, hızlı dönüş
Production max kalite	bf16 veya FP8	1 puan TR-MMLU farkı önemli

python

# === Round-trip eval — cookbook tarafından sağlanan script ===
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from lm_eval import simple_evaluate
 
models_to_test = {
    "bf16": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "awq_int4": "llama-3.1-8b-int4-awq",
    "gptq_int4": "llama-3.1-8b-int4-gptq",
    "fp8": "llama-3.1-8b-fp8",
}
 
tasks = ["mmlu_tr", "mt_bench_tr", "wikitext_perplexity"]
 
results = {}
for name, path in models_to_test.items():
    print(f"Evaluating {name}...")
    result = simple_evaluate(
        model="hf",
        model_args=f"pretrained={path},dtype=auto",
        tasks=tasks,
        device="cuda",
        batch_size="auto",
    )
    results[name] = result
 
# Print comparison table
print(f"{'Model':<15} {'TR-MMLU':<10} {'MT-Bench-TR':<15} {'WikiText PPL':<15}")
for name, result in results.items():
    print(f"{name:<15} {result['mmlu_tr']:<10.2f} {result['mt_bench_tr']:<15.2f} {result['wikitext_ppl']:<15.2f}")

comprehensive quantization eval

✅ Part X tamamlandı

Kendi FT modelini 3-4 farklı quant'a dönüştür. 2) Aynı eval'i koş. 3) Karar matrisini kendi use-case'ine uygula. 4) Sonraki Part: Part XI — Alignment & Preference Optimization (DPO, ORPO, KTO, SimPO, GRPO). Modern alignment'ın matematiği + RTX 4090 reçeteleri.