Skip to content

LoRA + QLoRA: Parameter-Efficient Fine-Tuning Revolution — From Hu 2021 to Dettmers 2023

LoRA (Hu 2021): low-rank decomposition fine-tuning — base weights frozen, train only small adapter. %1 parameters, %95+ quality preservation. QLoRA (Dettmers 2023): 4-bit base + LoRA, fine-tune 70B model on consumer GPU. NF4 quantization, paged optimizer. Turkish practical: $5K cost production Turkish Llama-3 70B.

Şükrü Yusuf KAYA
75 min read
Advanced
LoRA + QLoRA: Parameter-Efficient Fine-Tuning Devrim — Hu 2021'den Dettmers 2023'e
🎨 LoRA — fine-tuning'in demokratikleşmesi
Llama-3-70B full fine-tune: 8× H100 needed (~25Kcompute).C\cog˘udeveloperic\cinimkansız.Ama2021deMicrosoftHuetal.LoRApaperınıyayınladı:LowRankAdaptation.Basemodeldondur,sadeceku¨c\cu¨kadaptermatrislerinieg˘it.25K compute). Çoğu developer için imkansız. Ama 2021'de Microsoft Hu et al. **LoRA** paper'ını yayınladı: 'Low-Rank Adaptation'. Base model **dondur**, sadece küçük 'adapter' matrislerini eğit. **%1 parametre**, **%95+ quality**. 2023'te Dettmers ekibi **QLoRA**: base'i 4-bit quantize et + LoRA. Llama-3-70B'i 24 GB consumer GPU'da fine-tune et. 25K → $500. Türkçe fine-tune democratized. 75 dakika sonra: LoRA matematiksel anatomisini, QLoRA quantization tricks'ini, production fine-tune setup'ını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. Full fine-tune cost — niye LoRA gerekli
  2. LoRA math — rank decomposition
  3. Adapter matrices A, B — math anatomi
  4. Hu 2021 paper — empirical findings
  5. PEFT library — HuggingFace implementation
  6. QLoRA (Dettmers 2023) — 4-bit quantization
  7. NF4 + double quant + paged optim — tricks
  8. Production setup — Llama-3 + LoRA training
  9. Inference — LoRA merge or hot-swap
  10. Türkçe pratik — $500 ile production Türkçe model

2-5. LoRA Math#

2.1 Insight#

Fine-tune'da weight update ΔW = W_new - W_pretrained. Hu 2021 hypothesis: ΔW low-rank olabilir — yani bilgi düşük dimensional'da yaşıyor.
Formalleştir: ΔW ≈ B A where:
  • B: [d_out, r]
  • A: [r, d_in]
  • r << min(d_in, d_out)
ΔW shape: [d_out, d_in] = d_out × d_in params. BA shape: d_out × r + r × d_in = r × (d_in + d_out). Çok daha az.

2.2 Forward pass#

LoRA-augmented forward:
output = W_pretrained × x + B A × x = W_pretrained × x + α × (B A) × x # α scale factor
W_pretrained frozen (no gradient). B, A trainable.

2.3 Rank choice#

  • r = 4: very compact, %0.1 params, OK quality
  • r = 16-32: typical, %0.5-1% params, great quality
  • r = 64+: larger, %2-5% params, near-full-FT quality

2.4 α (alpha)#

LoRA scaling factor:
output = W_pretrained × x + (α/r) × B A × x
α/r normalizes magnitudes across different ranks. Typical α = 16 or 32.

2.5 LoRA params count#

Llama-3-8B linear layers ~6B params (excluding embeddings):
  • Full fine-tune: 6B params trainable
  • LoRA r=16: ~25M params (%0.4)
  • LoRA r=64: ~100M params (%1.7)

2.6 Hu 2021 empirical#

GPT-3 175B fine-tune:
  • Full FT: 175B params, 24× A100 days
  • LoRA r=8: 17.5M params, 1/10 compute
  • Quality: comparable (downstream tasks within 1%)
Later work: r=16 typical, quality near full-FT.

2.7 PyTorch + PEFT#

from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # trainable params: 25M | all params: 8B | trainable%: 0.31

2.8 LoRA + transformer#

Apply LoRA to which layers? Common:
  • Attention Q, K, V projections (most impact)
  • Attention output projection
  • FFN gates (sometimes)
  • NOT applied: embeddings, layernorms, biases
Flexibility: target_modules parameter.

6-9. QLoRA + Production#

6.1 Dettmers 2023 QLoRA#

May 2023: 'QLoRA: Efficient Finetuning of Quantized LLMs'.
Kombinasyon:
  • Base model: 4-bit quantized (frozen, no gradient)
  • LoRA adapter: 16-bit (trainable)
  • Memory savings dramatic

6.2 NF4 (4-bit NormalFloat)#

Dettmers'in tasarladığı 4-bit format. 16 değer ([-1, 1] aralığında), normal distribution'a optimize.
Vs INT4: better quantization for neural weights (normal-distributed).

6.3 Double quantization#

Additional optimization: quantization constants themselves quantized to FP8. %0.3 extra memory savings.

6.4 Paged optimizer#

LLM training random spikes optimizer memory. Solution: paged memory (CPU swap to NVMe when over GPU limit). Prevent OOM crashes.

6.5 Memory math#

Llama-3-70B:
  • Full FT: 140 GB params + 280 GB grad + 560 GB optim = ~1 TB → 16× H100
  • QLoRA: 35 GB (4-bit params) + 100M (LoRA params) + 200M (optim) ≈ 36 GB → fits in 48GB GPU!
70B model on single A6000/H100 GPU.

6.6 Quality preservation#

Dettmers benchmarks: QLoRA quality % within full FT on:
  • MMLU
  • Vicuna eval
  • Sample generation quality
Quantization quality loss negligible due to LoRA correction.

6.7 Production Türkçe QLoRA recipe#

from transformers import AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training # Quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-70B", quantization_config=bnb_config, device_map="auto", ) model = prepare_model_for_kbit_training(model) # LoRA lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) # Train (SFTTrainer) # ... same as full SFT but with QLoRA
Cost: 1× H100 (80GB) 1 hafta → ~$500. Türkçe Llama-3-70B-Instruct ready.

6.8 Inference#

İki yöntem:
  1. Merge: LoRA weights base'e merge → standalone model
model = model.merge_and_unload() model.save_pretrained("./merged-model")
Production: clean serving, no PEFT runtime dependency.
  1. Hot-swap: LoRA separate, runtime swap multiple LoRA's
model.set_adapter("turkish-lora") # or other adapter
Multi-tenant SaaS: same base, different LoRA per customer.
✅ Ders 14.2 Özeti — LoRA + QLoRA
LoRA (Hu 2021): rank decomposition fine-tuning, ΔW ≈ BA, %0.4-1.7 parameters, %95+ quality. Adapter matrices target attention/FFN projections. QLoRA (Dettmers 2023): 4-bit NF4 base + LoRA adapter. 70B model single GPU fine-tune (48 GB). NF4 + double quant + paged optimizer tricks. Production Türkçe: QLoRA Llama-3-70B → $500/hafta. Inference: merge (production) or hot-swap (multi-tenant). Ders 14.3'te capstone — kendi Türkçe Llama-3 fine-tune.

Sıradaki Ders: Türkçe Llama-3 Fine-Tune Capstone#

Ders 14.3 (Modül 14 capstone): Llama-3-8B + Türkçe SFT + QLoRA = production-quality Türkçe model. Dataset curation + training + evaluation + HuggingFace publish.

Frequently Asked Questions

Most tasks: yes (within 1-3%). Specific domains (math reasoning, code): LoRA still behind full FT %5-10. Generally production: LoRA preferred (cost/quality trade-off).

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content

Connected pillar topics

Pillar topics this article maps to