LoRA + QLoRA: Parameter-Efficient Fine-Tuning Devrim — Hu 2021'den Dettmers 2023'e
LoRA (Hu 2021): low-rank decomposition fine-tuning — base weights frozen, sadece küçük adapter eğit. %1 parameters, %95+ quality preservation. QLoRA (Dettmers 2023): 4-bit base + LoRA, 70B model'i consumer GPU'da fine-tune. NF4 quantization, paged optimizer. Türkçe pratik: $5K maliyetle production Türkçe Llama-3 70B.
Şükrü Yusuf KAYA
75 dakikalık okuma
İleri🎨 LoRA — fine-tuning'in demokratikleşmesi
Llama-3-70B full fine-tune: 8× H100 needed (~25K → $500. Türkçe fine-tune democratized. 75 dakika sonra: LoRA matematiksel anatomisini, QLoRA quantization tricks'ini, production fine-tune setup'ını kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- Full fine-tune cost — niye LoRA gerekli
- LoRA math — rank decomposition
- Adapter matrices A, B — math anatomi
- Hu 2021 paper — empirical findings
- PEFT library — HuggingFace implementation
- QLoRA (Dettmers 2023) — 4-bit quantization
- NF4 + double quant + paged optim — tricks
- Production setup — Llama-3 + LoRA training
- Inference — LoRA merge or hot-swap
- Türkçe pratik — $500 ile production Türkçe model
2-5. LoRA Math#
2.1 Insight#
Fine-tune'da weight update ΔW = W_new - W_pretrained. Hu 2021 hypothesis: ΔW low-rank olabilir — yani bilgi düşük dimensional'da yaşıyor.
Formalleştir: ΔW ≈ B A where:
- B: [d_out, r]
- A: [r, d_in]
- r << min(d_in, d_out)
ΔW shape: [d_out, d_in] = d_out × d_in params. BA shape: d_out × r + r × d_in = r × (d_in + d_out). Çok daha az.
2.2 Forward pass#
LoRA-augmented forward:
output = W_pretrained × x + B A × x = W_pretrained × x + α × (B A) × x # α scale factor
W_pretrained frozen (no gradient). B, A trainable.
2.3 Rank choice#
- r = 4: very compact, %0.1 params, OK quality
- r = 16-32: typical, %0.5-1% params, great quality
- r = 64+: larger, %2-5% params, near-full-FT quality
2.4 α (alpha)#
LoRA scaling factor:
output = W_pretrained × x + (α/r) × B A × x
α/r normalizes magnitudes across different ranks. Typical α = 16 or 32.
2.5 LoRA params count#
Llama-3-8B linear layers ~6B params (excluding embeddings):
- Full fine-tune: 6B params trainable
- LoRA r=16: ~25M params (%0.4)
- LoRA r=64: ~100M params (%1.7)
2.6 Hu 2021 empirical#
GPT-3 175B fine-tune:
- Full FT: 175B params, 24× A100 days
- LoRA r=8: 17.5M params, 1/10 compute
- Quality: comparable (downstream tasks within 1%)
Later work: r=16 typical, quality near full-FT.
2.7 PyTorch + PEFT#
from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # trainable params: 25M | all params: 8B | trainable%: 0.31
2.8 LoRA + transformer#
Apply LoRA to which layers? Common:
- Attention Q, K, V projections (most impact)
- Attention output projection
- FFN gates (sometimes)
- NOT applied: embeddings, layernorms, biases
Flexibility: target_modules parameter.
6-9. QLoRA + Production#
6.1 Dettmers 2023 QLoRA#
May 2023: 'QLoRA: Efficient Finetuning of Quantized LLMs'.
Kombinasyon:
- Base model: 4-bit quantized (frozen, no gradient)
- LoRA adapter: 16-bit (trainable)
- Memory savings dramatic
6.2 NF4 (4-bit NormalFloat)#
Dettmers'in tasarladığı 4-bit format. 16 değer ([-1, 1] aralığında), normal distribution'a optimize.
Vs INT4: better quantization for neural weights (normal-distributed).
6.3 Double quantization#
Additional optimization: quantization constants themselves quantized to FP8.
%0.3 extra memory savings.
6.4 Paged optimizer#
LLM training random spikes optimizer memory. Solution: paged memory (CPU swap to NVMe when over GPU limit). Prevent OOM crashes.
6.5 Memory math#
Llama-3-70B:
- Full FT: 140 GB params + 280 GB grad + 560 GB optim = ~1 TB → 16× H100
- QLoRA: 35 GB (4-bit params) + 100M (LoRA params) + 200M (optim) ≈ 36 GB → fits in 48GB GPU!
70B model on single A6000/H100 GPU.
6.6 Quality preservation#
Dettmers benchmarks: QLoRA quality % within full FT on:
- MMLU
- Vicuna eval
- Sample generation quality
Quantization quality loss negligible due to LoRA correction.
6.7 Production Türkçe QLoRA recipe#
from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training # Quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-70B", quantization_config=bnb_config, device_map="auto", ) model = prepare_model_for_kbit_training(model) # LoRA lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) # Train (SFTTrainer) # ... same as full SFT but with QLoRA
Cost: 1× H100 (80GB) 1 hafta → ~$500. Türkçe Llama-3-70B-Instruct ready.
6.8 Inference#
İki yöntem:
- Merge: LoRA weights base'e merge → standalone model
model = model.merge_and_unload() model.save_pretrained("./merged-model")
Production: clean serving, no PEFT runtime dependency.
- Hot-swap: LoRA separate, runtime swap multiple LoRA's
model.set_adapter("turkish-lora") # or other adapter
Multi-tenant SaaS: same base, different LoRA per customer.
✅ Ders 14.2 Özeti — LoRA + QLoRA
LoRA (Hu 2021): rank decomposition fine-tuning, ΔW ≈ BA, %0.4-1.7 parameters, %95+ quality. Adapter matrices target attention/FFN projections. QLoRA (Dettmers 2023): 4-bit NF4 base + LoRA adapter. 70B model single GPU fine-tune (48 GB). NF4 + double quant + paged optimizer tricks. Production Türkçe: QLoRA Llama-3-70B → $500/hafta. Inference: merge (production) or hot-swap (multi-tenant). Ders 14.3'te capstone — kendi Türkçe Llama-3 fine-tune.
Sıradaki Ders: Türkçe Llama-3 Fine-Tune Capstone#
Ders 14.3 (Modül 14 capstone): Llama-3-8B + Türkçe SFT + QLoRA = production-quality Türkçe model. Dataset curation + training + evaluation + HuggingFace publish.
Sık Sorulan Sorular
Most tasks: yes (within 1-3%). Specific domains (math reasoning, code): LoRA hâlâ behind full FT %5-10. Genelde production: LoRA tercih (cost/quality trade-off).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
İlgili İçerikler
Modül 0: Kurs Çerçevesi ve Atölye Kurulumu
LLM Engineer Kimdir? Junior'dan Staff'a Yapay Zekâ Mühendisliği Kariyer Haritası
Öğrenmeye BaşlaModül 0: Kurs Çerçevesi ve Atölye Kurulumu
Kurs Felsefesi: Neden Bu Yol, Neden Bu Sıra — 8 Aylık Müfredatın İskeleti
Öğrenmeye BaşlaModül 0: Kurs Çerçevesi ve Atölye Kurulumu
Atölye Kurulumu: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight
Öğrenmeye BaşlaBağlantılı Pillar Konular