Int4 QLoRA NF4 Internals: Double Quantization + Paged Optimizer + Bitsandbytes Source Tour

NF4 (4-bit NormalFloat) — QLoRA'nın çekirdeği. Normal distributed weights için optimal 4-bit kuantasyon. Double-quantization (scale tensor'unu da quantize et) ile ek %0.4 bit/param tasarrufu. Paged AdamW (CPU RAM'e overflow). bitsandbytes source-code tour.

Şükrü Yusuf KAYA

30 dakikalık okuma

14.05.2026

İleri

Int4 QLoRA NF4 Internals: Double Quantization + Paged Optimizer + Bitsandbytes Source Tour

1. NF4 — Niye 'NormalFloat'?#

LLM weights neredeyse mükemmel standart normal dağılımlıdır (

N(0, σ)

). Int4 ile [-7, +7] uniform spacing kullanmak normal dağılıma kötü uyar — orta yoğun bölgeyi underuse, uç bölgeyi overuse eder.

NF4 fikri: Quantization seviyelerini normal dağılımın quantile'ları olacak şekilde seç:

NF4 levels (16 values):
[-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
  0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0]

Orta'da yoğun (Gaussian peak)
Uç bölgelerde seyrek (Gaussian tail)
Information-theoretically optimal eğer weights N(0,1) ise (norm'lanmış)

Sonuç: Aynı 4-bit ile uniform int4'ten ~%4-6 daha düşük perplexity.

2. Double Quantization#

NF4 her 64 weight için 1 fp32 scale tensor tutar. 8B model:

8B params / 64 group = 125M scale params
125M × 4 byte (fp32) = 500 MB extra

Çözüm — double-quantization: scale tensor'unu da quantize et!

Scale'leri 256-block-wise quantize et (8-bit)
Her block için ayrı bir "scale of scale" (fp32)

Toplam quantization overhead:
  Without double-quant: 4 + 32/64 = 4.5 bit/weight
  With double-quant:    4 + 8/64 + 32/(64×256) = 4.13 bit/weight

Tasarruf: Ek ~%3 model size azalması, kalite üzerinde gözlemlenebilir bir etki yok.

python

# === bitsandbytes NF4 internal — adım adım ===
import torch
import bitsandbytes as bnb
 
# 1. NF4 levels
NF4_LEVELS = torch.tensor([
    -1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
    0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0
], dtype=torch.float32)
 
# 2. Forward quantize
def nf4_quantize(W, block_size=64):
    """W: bf16 tensor → 4-bit NF4 + scale per block."""
    W_flat = W.flatten()
    n = W_flat.numel()
    n_blocks = n // block_size
 
    # Block-wise absolute max
    W_blocks = W_flat.reshape(n_blocks, block_size)
    scales = W_blocks.abs().max(dim=-1).values         # [n_blocks]
 
    # Normalize → [-1, 1]
    W_normalized = W_blocks / scales.unsqueeze(-1)
 
    # Find closest NF4 level for each weight
    distances = torch.abs(W_normalized.unsqueeze(-1) - NF4_LEVELS.unsqueeze(0).unsqueeze(0))
    indices = distances.argmin(dim=-1)                 # [n_blocks, block_size]
 
    # Pack 2 indices per byte (4-bit)
    indices_4bit = (indices[..., ::2] << 4) | indices[..., 1::2]    # [n_blocks, block_size//2]
 
    return indices_4bit, scales
 
# 3. Dequantize (inference)
def nf4_dequantize(indices_4bit, scales):
    high = (indices_4bit >> 4) & 0xF
    low = indices_4bit & 0xF
    indices = torch.stack([high, low], dim=-1).flatten(-2)
    values = NF4_LEVELS[indices]                       # [n_blocks, block_size]
    W = values * scales.unsqueeze(-1)
    return W.flatten()
 
# 4. bitsandbytes equivalent
linear_4bit = bnb.nn.Linear4bit(
    input_features=4096,
    output_features=4096,
    bias=False,
    quant_type="nf4",                                  # vs "fp4"
    compute_dtype=torch.bfloat16,                       # forward bf16
    quant_storage=torch.uint8,                          # storage
)
# Internally bunu yapar — sadece daha optimized + CUDA kernel ile

NF4 quantize + dequantize internal

3. Paged AdamW — CPU RAM Overflow#

QLoRA + 70B+ modellerde optimizer state hâlâ büyük olabilir. Çözüm: paged optimizer:

Optimizer state'leri GPU yerine CPU pinned memory'ye koy
Step zamanı bunları GPU'ya page edip backward'tan sonra geri yaz
PCIe overhead'i prefetch ile gizle

from bitsandbytes.optim import PagedAdamW8bit

optimizer = PagedAdamW8bit(
    model.parameters(),
    lr=2e-4,
    weight_decay=0.0,
)
# Otomatik: 8-bit state + CPU offload

Cookbook'un kuralı (RTX 4090):

1B-8B QLoRA →
paged_adamw_8bit
default
14B-32B QLoRA →
paged_adamw_8bit
zorunlu (yoksa GPU OOM)
70B+ QLoRA → tüm state CPU'da (~80 GB RAM gerekir)

✅ Teslim

Yukarıdaki NF4 quantize/dequantize kodunu manuel implement et. 2) bnb.nn.Linear4bit ile karşılaştır — output bit-exact olmalı. 3) Sonraki ders: 10.8 — FP8 Inference (vLLM SmoothQuant).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Part 0 — Engineering Foundations

Fine-Tuning Cookbook'a Hoş Geldin: Sistematik, Stage Taksonomisi ve Reproducibility Kontratı

Öğrenmeye Başla

Part 0 — Engineering Foundations

Reproducibility Stack: Seeds, cuDNN Flags ve Deterministic CUDA — 'Sende Niye Çalışıyor Bende Çalışmıyor' Sorununu Bitir

Öğrenmeye Başla

Part 0 — Engineering Foundations

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix ve Container Reçeteleri

Öğrenmeye Başla