EXL2 (ExLlamaV2): Variable Bitrate Quantization — Which Layer at Which Bit?
EXL2 — ExLlamaV2's native format. Different bit-width per layer; sensitive layers get more bits. Measure layer sensitivity via calibration, optimal allocation within budget. Fastest LLM inference for single-user on RTX 4090 (1.5-2x vs vLLM at batch=1).
Şükrü Yusuf KAYA
28 min read
Advanced1. EXL2'nin Fikri#
GPTQ/AWQ: tüm layer'lar aynı bit-width (örn. 4-bit).
EXL2: layer başına bit-width farklı — hassas layer'lara fazla bit, az hassas olanlara az bit.
"hedef ortalama 4.0 bit/weight" ↓ calibration ile her layer'ın sensitivity ölç ↓ SQNR (signal-to-quantization-noise ratio) minimize edecek allocation: - q_proj: 5.5 bit (hassas) - down_proj: 3.5 bit (toleranslı) - lm_head: 6.0 bit (kritik)
Sonuç: Aynı disk size'da daha iyi kalite veya aynı kalite ile daha küçük model.
bash
# === ExLlamaV2 ile EXL2 dönüşüm ===# clone exllamav2git clone https://github.com/turboderp/exllamav2cd exllamav2 # Convert HF → EXL2python convert.py \ -i /path/to/llama-3.1-8b-merged \ -o /path/to/llama-3.1-8b-exl2-4.5bpw \ -cf /path/to/calibration_data.parquet \ -hb 4.5 \ # target avg bits per weight -ml 4096 \ # max measurement length -nr 5 # number of measurement rows # Test inferencepython examples/inference.py \ -m /path/to/llama-3.1-8b-exl2-4.5bpw \ -p "Türkiye'nin başkenti nedir?"EXL2 dönüşüm + inference
2. RTX 4090 Inference Throughput#
| Format | Batch=1 tok/s | Batch=4 tok/s | Notlar |
|---|---|---|---|
| bf16 (Transformers) | 35 | 85 | baseline |
| bf16 (vLLM) | 95 | 180 | scheduling overhead az |
| GGUF Q4_K_M (llama.cpp) | 75 | n/a (mostly single-user) | CPU-friendly |
| GPTQ int4 (vLLM) | 165 | 320 | shared kernel |
| AWQ int4 (vLLM) | 175 | 340 | en yaygın production |
| EXL2 4.5bpw (ExLlamaV2) | 245 | 140 | batch=1 en hızlı |
Pattern: EXL2 batch=1 (single user)'da en hızlı çünkü kernel'ları RTX 40x0/Ada'ya optimize. Multi-batch production'da vLLM/AWQ daha verimli.
Cookbook'un kuralı:
- Tek user lokal chat → ExLlamaV2 + EXL2
- Multi-user / production API → vLLM + AWQ
- CPU/edge → llama.cpp + GGUF Q4_K_M
✅ Teslim
- Llama 3.1 8B'yi EXL2 4.5bpw'a dönüştür. 2) ExLlamaV2 ile single-user tok/s ölç. 3) Sonraki ders: 10.6 — FP8 Training (H100 native, RTX 4090'da prematur).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations