Calibration Dataset Engineering: Domain-Aware Quantization — Ideal Set for Your Domain
GPTQ/AWQ quality heavily depends on calibration data. WikiText-2 default but varies by production use-case. TR calibration in TR production → 30% better TR-MMLU post-quant. Code domain GitHub Python. Math domain GSM8K. Calibration size sweet spot (128-512).
Şükrü Yusuf KAYA
24 min read
Advanced1. Calibration Dataset Karşılaştırma#
Llama 3.1 8B + AWQ + 256 sample, post-quant TR-MMLU + WikiText-2 PPL:
| Calibration | TR-MMLU | WikiText-2 PPL | GSM8K | Domain match |
|---|---|---|---|---|
| WikiText-2 (default, EN) | 32.0 | 5.95 | 84.1 | EN web |
| C4 multilingual | 32.3 | 6.05 | 84.0 | multi-lang |
| OASST-TR (TR chat) | 33.4 | 6.30 | 80.5 | TR chat |
| GSM8K (math) | 31.0 | 7.20 | 86.8 | math |
| GitHub Python | 30.5 | 6.50 | 83.0 | code |
| Production prompts (in-domain) | 34.1 | 5.98 | 85.0 | match |
Çıkarım: Hangi metrik kritikse o domain'in calibration set'i kullanılmalı.
Cookbook'un kuralı: Production'da gerçek kullanıcı promp'larından 200-500 örnek sample → in-domain calibration set yap.
python
# === In-domain calibration set hazırlama ===from datasets import Dataset # 1. Production logs'tan örnek topla (kullanıcı promp'ları, anonimleştirilmiş)production_prompts = load_production_logs(n=500, anonymize=True) # 2. Filter — quality + uzunluk + token sayısıfiltered = [p for p in production_prompts if 50 < len(p) < 1000 and is_turkish(p)] # 3. Tokenize + format for calibrationimport torchfrom transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")calibration = []for p in filtered[:256]: inputs = tok(p, return_tensors="pt", max_length=2048, truncation=True) calibration.append({ "input_ids": inputs["input_ids"][0], "attention_mask": inputs["attention_mask"][0], }) # 4. Bu set'i AWQ veya GPTQ'ya vermodel.quantize(tok, calibration_data=calibration)production logs'tan in-domain calibration set
2. Calibration Size Sweet Spot#
| N samples | Quantization time (8B AWQ) | Quality (TR-MMLU) |
|---|---|---|
| 32 | 4 min | 31.8 |
| 64 | 5 min | 32.0 |
| 128 | 7 min | 32.2 |
| 256 | 9 min | 32.3 |
| 512 | 13 min | 32.3 |
| 1024 | 22 min | 32.3 |
| 2048 | 40 min | 32.3 |
Marginal kalite artışı 256 sample'dan sonra. Cookbook default: 256 sample.
✅ Teslim
- 256 sample'lık TR in-domain calibration set hazırla. 2) Aynı modeli WikiText-2 ve TR calib ile quantize et, TR-MMLU'da karşılaştır. 3) Sonraki ders: 10.10 — Round-trip Eval: Pre/Post Quant Tablo.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations