GGUF K-Quants: Q2_K → Q8_K llama-quantize Perplexity Tablosu

GGUF K-Quants Block Structure: Q2_K → Q8_K + llama-quantize Perplexity Tablosu

GGUF — llama.cpp'nin native format'ı, CPU/edge inference için yaygın. K-quants block structure (Q2_K → Q8_K), her bit-width için ayrı struct, llama-quantize ile dönüşüm, perplexity-vs-size eğrisi. RTX 4090'da bf16 → Q4_K_M conversion 5 dakika, Q4 GGUF 4.6 GB → CPU/Pi/iPhone deploy.

Şükrü Yusuf KAYA

32 dakikalık okuma

14.05.2026

İleri

1. K-Quants Tablosu#

Quant	Bit/weight	Block size	Llama 8B size	PPL delta (WikiText-2)	Use case
Q2_K	2.5	16	3.2 GB	+%8-12	mobile, edge
Q3_K_S	3.4	16	4.0 GB	+%4-6	edge
Q3_K_M	3.9	16	4.1 GB	+%3-5	balanced
Q3_K_L	4.3	16	4.3 GB	+%2-4	quality
Q4_K_S	4.5	32	4.5 GB	+%1-3	balanced
Q4_K_M	4.85	32	4.6 GB	+%0.8-2	cookbook default
Q5_K_S	5.5	32	5.3 GB	+%0.4-1	high quality
Q5_K_M	5.7	32	5.4 GB	+%0.3-0.7	very high
Q6_K	6.6	16	6.2 GB	+%0.1-0.3	near-lossless
Q8_K	8.5	16	8.5 GB	<%0.1	~bf16 equivalent

Cookbook'un kuralı: Q4_K_M mobile/edge için sweet spot. Q5_K_M kalite kritikse. Q8_K test/dev için.

2. K-Quants Block Structure#

Q4_K_M

block (32 weight × ~5 byte):

struct ggml_block_q4_K {
    ggml_half d;                  // 16-bit super-block delta
    ggml_half dmin;               // 16-bit super-block dmin
    uint8_t scales[12];           // 12 bytes: per-block 6-bit scales (8 blocks)
    uint8_t qs[QK_K / 2];         // 256/2 = 128 bytes: 4-bit packed weights
};

Super-block: 256 weight
Within super-block: 8 sub-blocks × 32 weight
Per sub-block: 6-bit scale (eight scales in 12 bytes = 96 bits)
Per weight: 4-bit value (sub-block scale ile dequantize)

Anatomik anlam: GGUF K-quants çok katmanlı — supre-block delta + sub-block scale + weight. Bu hiyerarşi sayesinde per-tensor değil per-256-weight scale → dynamic range korunur.

bash

# === bf16 → GGUF → Q4_K_M dönüşüm ===
# 1. Merge LoRA adapter (eğer fine-tuned ise)
python -c "
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct',
                                             torch_dtype='bfloat16')
m = PeftModel.from_pretrained(base, 'llama-3.1-8b-tr-instruct/final')
m.merge_and_unload().save_pretrained('llama-3.1-8b-tr-merged')
"
 
# 2. HF → GGUF (llama.cpp)
cd llama.cpp
python convert_hf_to_gguf.py ../llama-3.1-8b-tr-merged \
    --outfile llama-8b-tr.fp16.gguf
 
# 3. Quantize Q4_K_M
./llama-quantize llama-8b-tr.fp16.gguf llama-8b-tr.Q4_K_M.gguf Q4_K_M
# Output: 4.6 GB (-71% vs bf16)
 
# 4. Test
./llama-cli -m llama-8b-tr.Q4_K_M.gguf -p "İstanbul'un nüfusu nedir?" -n 200
 
# 5. Perplexity ölç (WikiText-2)
./llama-perplexity -m llama-8b-tr.Q4_K_M.gguf -f wikitext-2.test.txt -c 2048
 
# Tipik sonuçlar (Llama 3.1 8B-Instruct):
# bf16 baseline: PPL = 5.93
# Q4_K_M:        PPL = 6.04  (+1.9%)
# Q5_K_M:        PPL = 5.96  (+0.5%)
# Q8_K:          PPL = 5.93  (~0%)

bf16 → GGUF Q4_K_M conversion pipeline

3. CPU Inference Bench#

Llama 3.1 8B Q4_K_M (4.6 GB):

Cihaz	tok/s
Apple M2 Pro (10-core)	28-35
Apple M3 Max (40-core GPU)	65-75
Ryzen 7950X (16-core, 64GB DDR5)	18-22
iPhone 15 Pro (A17)	11-14
Pixel 8 Pro	7-10
Raspberry Pi 5 (8GB)	2-3

Karar: CPU inference için Q4_K_M sweet — daha aşağıya inmen kalite kaybı, yukarı çıkmak speed kaybı.

✅ Teslim

Kendi FT modelini Q4_K_M'e dönüştür. 2) llama-perplexity ile PPL ölç. 3) CPU/laptop'unda llama-cli ile test et. 4) Sonraki ders: 10.5 — EXL2 Variable Bitrate.

GGUF K-Quants Block Structure: Q2_K → Q8_K + llama-quantize Perplexity Tablosu

1. K-Quants Tablosu#

2. K-Quants Block Structure#

3. CPU Inference Bench#

Yorumlar & Soru-Cevap

İlgili İçerikler

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Bültenime Abone Olun