LoRA vs full fine-tune — quality gerçekten comparable mı?

Q: LoRA vs full fine-tune — quality gerçekten comparable mı?

Most tasks: yes (within 1-3%). Specific domains (math reasoning, code): LoRA hâlâ behind full FT %5-10. Genelde production: LoRA tercih (cost/quality trade-off).

LoRA + QLoRA: Parameter-Efficient Fine-Tuning Devrim — Hu 2021'den Dettmers 2023'e

LoRA (Hu 2021): low-rank decomposition fine-tuning — base weights frozen, sadece küçük adapter eğit. %1 parameters, %95+ quality preservation. QLoRA (Dettmers 2023): 4-bit base + LoRA, 70B model'i consumer GPU'da fine-tune. NF4 quantization, paged optimizer. Türkçe pratik: $5K maliyetle production Türkçe Llama-3 70B.

Şükrü Yusuf KAYA

75 dakikalık okuma

22.06.2026

İleri

LoRA + QLoRA: Parameter-Efficient Fine-Tuning Devrim — Hu 2021'den Dettmers 2023'e

🎨 LoRA — fine-tuning'in demokratikleşmesi

Llama-3-70B full fine-tune: 8× H100 needed (~

25K compute). Çoğu developer için imkansız. Ama 2021'de Microsoft Hu et al. **LoRA** paper'ını yayınladı: 'Low-Rank Adaptation'. Base model **dondur**, sadece küçük 'adapter' matrislerini eğit. **%1 parametre**, **%95+ quality**. 2023'te Dettmers ekibi **QLoRA**: base'i 4-bit quantize et + LoRA. Llama-3-70B'i 24 GB consumer GPU'da fine-tune et.

25K → $500. Türkçe fine-tune democratized. 75 dakika sonra: LoRA matematiksel anatomisini, QLoRA quantization tricks'ini, production fine-tune setup'ını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

Full fine-tune cost — niye LoRA gerekli
LoRA math — rank decomposition
Adapter matrices A, B — math anatomi
Hu 2021 paper — empirical findings
PEFT library — HuggingFace implementation
QLoRA (Dettmers 2023) — 4-bit quantization
NF4 + double quant + paged optim — tricks
Production setup — Llama-3 + LoRA training
Inference — LoRA merge or hot-swap
Türkçe pratik — $500 ile production Türkçe model

2-5. LoRA Math#

2.1 Insight#

Fine-tune'da weight update ΔW = W_new - W_pretrained. Hu 2021 hypothesis: ΔW low-rank olabilir — yani bilgi düşük dimensional'da yaşıyor.

Formalleştir: ΔW ≈ B A where:

B: [d_out, r]
A: [r, d_in]
r << min(d_in, d_out)

ΔW shape: [d_out, d_in] = d_out × d_in params. BA shape: d_out × r + r × d_in = r × (d_in + d_out). Çok daha az.

2.2 Forward pass#

LoRA-augmented forward:

output = W_pretrained × x + B A × x
       = W_pretrained × x + α × (B A) × x      # α scale factor

W_pretrained frozen (no gradient). B, A trainable.

2.3 Rank choice#

r = 4: very compact, %0.1 params, OK quality
r = 16-32: typical, %0.5-1% params, great quality
r = 64+: larger, %2-5% params, near-full-FT quality

2.4 α (alpha)#

LoRA scaling factor:

output = W_pretrained × x + (α/r) × B A × x

α/r normalizes magnitudes across different ranks. Typical α = 16 or 32.

2.5 LoRA params count#

Llama-3-8B linear layers ~6B params (excluding embeddings):

Full fine-tune: 6B params trainable
LoRA r=16: ~25M params (%0.4)
LoRA r=64: ~100M params (%1.7)

2.6 Hu 2021 empirical#

GPT-3 175B fine-tune:

Full FT: 175B params, 24× A100 days
LoRA r=8: 17.5M params, 1/10 compute
Quality: comparable (downstream tasks within 1%)

Later work: r=16 typical, quality near full-FT.

2.7 PyTorch + PEFT#

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 25M | all params: 8B | trainable%: 0.31

2.8 LoRA + transformer#

Apply LoRA to which layers? Common:

Attention Q, K, V projections (most impact)
Attention output projection
FFN gates (sometimes)
NOT applied: embeddings, layernorms, biases

Flexibility: target_modules parameter.

6-9. QLoRA + Production#

6.1 Dettmers 2023 QLoRA#

May 2023: 'QLoRA: Efficient Finetuning of Quantized LLMs'.

Kombinasyon:

Base model: 4-bit quantized (frozen, no gradient)
LoRA adapter: 16-bit (trainable)
Memory savings dramatic

6.2 NF4 (4-bit NormalFloat)#

Dettmers'in tasarladığı 4-bit format. 16 değer ([-1, 1] aralığında), normal distribution'a optimize.

Vs INT4: better quantization for neural weights (normal-distributed).

6.3 Double quantization#

Additional optimization: quantization constants themselves quantized to FP8. %0.3 extra memory savings.

6.4 Paged optimizer#

LLM training random spikes optimizer memory. Solution: paged memory (CPU swap to NVMe when over GPU limit). Prevent OOM crashes.

6.5 Memory math#

Llama-3-70B:

Full FT: 140 GB params + 280 GB grad + 560 GB optim = ~1 TB → 16× H100
QLoRA: 35 GB (4-bit params) + 100M (LoRA params) + 200M (optim) ≈ 36 GB → fits in 48GB GPU!

70B model on single A6000/H100 GPU.

6.6 Quality preservation#

Dettmers benchmarks: QLoRA quality % within full FT on:

MMLU
Vicuna eval
Sample generation quality

Quantization quality loss negligible due to LoRA correction.

6.7 Production Türkçe QLoRA recipe#

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-70B",
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# Train (SFTTrainer)
# ... same as full SFT but with QLoRA

Cost: 1× H100 (80GB) 1 hafta → ~$500. Türkçe Llama-3-70B-Instruct ready.

6.8 Inference#

İki yöntem:

Merge: LoRA weights base'e merge → standalone model

model = model.merge_and_unload()
model.save_pretrained("./merged-model")

Production: clean serving, no PEFT runtime dependency.

Hot-swap: LoRA separate, runtime swap multiple LoRA's

model.set_adapter("turkish-lora")  # or other adapter

Multi-tenant SaaS: same base, different LoRA per customer.

✅ Ders 14.2 Özeti — LoRA + QLoRA

LoRA (Hu 2021): rank decomposition fine-tuning, ΔW ≈ BA, %0.4-1.7 parameters, %95+ quality. Adapter matrices target attention/FFN projections. QLoRA (Dettmers 2023): 4-bit NF4 base + LoRA adapter. 70B model single GPU fine-tune (48 GB). NF4 + double quant + paged optimizer tricks. Production Türkçe: QLoRA Llama-3-70B → $500/hafta. Inference: merge (production) or hot-swap (multi-tenant). Ders 14.3'te capstone — kendi Türkçe Llama-3 fine-tune.

Sıradaki Ders: Türkçe Llama-3 Fine-Tune Capstone#

Ders 14.3 (Modül 14 capstone): Llama-3-8B + Türkçe SFT + QLoRA = production-quality Türkçe model. Dataset curation + training + evaluation + HuggingFace publish.

Sık Sorulan Sorular

Most tasks: yes (within 1-3%). Specific domains (math reasoning, code): LoRA hâlâ behind full FT %5-10. Genelde production: LoRA tercih (cost/quality trade-off).

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Modül 0: Kurs Çerçevesi ve Atölye Kurulumu

LLM Engineer Kimdir? Junior'dan Staff'a Yapay Zekâ Mühendisliği Kariyer Haritası

Öğrenmeye Başla

Modül 0: Kurs Çerçevesi ve Atölye Kurulumu

Kurs Felsefesi: Neden Bu Yol, Neden Bu Sıra — 8 Aylık Müfredatın İskeleti

Öğrenmeye Başla

Modül 0: Kurs Çerçevesi ve Atölye Kurulumu

Atölye Kurulumu: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

Öğrenmeye Başla

Bağlantılı Pillar Konular

Bu yazının bağlandığı pillar konular

Pillar Konusu

LLMOps: Üretim Sınıfı LLM Operasyonları

LLMOps, büyük dil modeli tabanlı uygulamaların geliştirme, dağıtım, izleme, değerlendirme ve maliyet yönetimini kapsayan; klasik MLOps'un üzerine prompt versiyonlama, eval-driven CI ve gözlemlenebilirlik (observability) katmanlarını ekleyen mühendislik disiplinidir.