Skip to content

Round-trip Eval: Pre/Post Quant Table — TR-MMLU + MT-Bench + Niche Benchmark

Part X capstone: Quantize the same model in bf16, AWQ int4, GPTQ int4, EXL2 4.5bpw, GGUF Q4_K_M, FP8 and compare. TR-MMLU, MT-Bench-TR, niche custom benchmark (Turkish call center sample). Decision matrix: which quant for your use-case?

Şükrü Yusuf KAYA
30 min read
Advanced
Round-trip Eval: Pre/Post Quant Tablo — TR-MMLU + MT-Bench + Niş Benchmark

1. Llama 3.1 8B-Instruct Comprehensive Quant Tablosu#

RTX 4090 + 256 sample TR calibration.
QuantSize (GB)TR-MMLUMT-Bench-TRWikiText PPLTok/s (batch=1)Tok/s (batch=16)
bf16 (reference)16.032.46.425.9395540
AWQ int44.432.0 (-0.4)6.30 (-0.12)5.99175920
GPTQ int44.531.86.256.04165870
EXL2 4.5bpw4.632.16.325.97245140
GGUF Q4_K_M4.631.66.186.0475 (CPU 22)n/a
GGUF Q5_K_M5.432.26.365.9670 (CPU 18)n/a
FP88.032.3 (-0.1)6.38 (-0.04)5.951551080
Karar matrisi:
Use caseÖnerilenNiye
Yüksek kalite + production servingFP8min kalite kaybı + batch throughput
Single user lokal chatEXL2 4.5bpwen hızlı batch=1
Multi-user budget APIAWQ int4en küçük + iyi batch throughput
Mobile / CPU / edgeGGUF Q4_K_Mlokal cihazlarda hızlı
Test / devbf16reference, hızlı dönüş
Production max kalitebf16 veya FP81 puan TR-MMLU farkı önemli
python
# === Round-trip eval — cookbook tarafından sağlanan script ===
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from lm_eval import simple_evaluate
 
models_to_test = {
"bf16": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"awq_int4": "llama-3.1-8b-int4-awq",
"gptq_int4": "llama-3.1-8b-int4-gptq",
"fp8": "llama-3.1-8b-fp8",
}
 
tasks = ["mmlu_tr", "mt_bench_tr", "wikitext_perplexity"]
 
results = {}
for name, path in models_to_test.items():
print(f"Evaluating {name}...")
result = simple_evaluate(
model="hf",
model_args=f"pretrained={path},dtype=auto",
tasks=tasks,
device="cuda",
batch_size="auto",
)
results[name] = result
 
# Print comparison table
print(f"{'Model':<15} {'TR-MMLU':<10} {'MT-Bench-TR':<15} {'WikiText PPL':<15}")
for name, result in results.items():
print(f"{name:<15} {result['mmlu_tr']:<10.2f} {result['mt_bench_tr']:<15.2f} {result['wikitext_ppl']:<15.2f}")
comprehensive quantization eval
✅ Part X tamamlandı
  1. Kendi FT modelini 3-4 farklı quant'a dönüştür. 2) Aynı eval'i koş. 3) Karar matrisini kendi use-case'ine uygula. 4) Sonraki Part: Part XI — Alignment & Preference Optimization (DPO, ORPO, KTO, SimPO, GRPO). Modern alignment'ın matematiği + RTX 4090 reçeteleri.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content