DPO: Direct Preference Optimization — Rafailov 2023, RLHF'in Cheaper Yeniden Doğuşu
DPO (Rafailov 2023): RLHF mathematical reformulation — no reward model, no RL. Direct preference loss. Llama-3 RLHF replacement. Math derivation, implementation simpler than PPO, comparable quality. Türkçe DPO pratik: $1K maliyetle 8B model alignment.
Şükrü Yusuf KAYA
70 dakikalık okuma
İleri💎 DPO — RLHF'in 'gizemli karmaşıklığı' ile vedalaşma
Stanford'dan Rafailov, Sharma, Mitchell, Ermon, Finn. Mayıs 2023: 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model'. Devrimsel keşif: RLHF'in 3-stage pipeline'ını tek loss function'a indir. No reward model, no PPO, no RL. Just supervised loss. Aynı objective, çok daha basit. 6 ay içinde RLHF'in tabutuna ilk çivi çakılmıştı. Llama-3, Mistral, Zephyr, Qwen — modern instruct modellerin çoğu DPO. 70 dakika sonra: DPO matematiksel keşfini, niye simpler quality preserves, Llama-3 production usage'ını kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- RLHF'in 3-stage karmaşıklığı — niye basitleşmeli
- DPO insight — reward model implicit
- Math derivation — Bradley-Terry + KL = direct loss
- DPO loss formula — final form
- β parameter — implicit KL weight
- Reference model — π_ref importance
- PyTorch implementation — TRL DPOTrainer
- Empirical DPO vs RLHF — Rafailov findings
- Production Llama-3 DPO — Meta usage
- Türkçe DPO — preference dataset + training $1K
2-5. DPO Math#
2.1 RLHF objective recap#
Maximize: E[r(s,a)] - β × KL(π || π_ref)
Reward model r(s,a) learned separately. Optimization via PPO.
2.2 DPO insight#
DPO derivation: closed-form solution for above optimization (under Bradley-Terry assumption):
r*(s,a) = β × log[π*(a|s) / π_ref(a|s)] + Z(s)
Reward = log-ratio of optimal policy to reference, plus state-only term Z(s).
2.3 Plug into Bradley-Terry#
P(A > B) = σ(r*(A) - r*(B)) = σ(β × log[π*(A)/π_ref(A)] - β × log[π*(B)/π_ref(B)])
Z(s) cancels (same state)!
2.4 DPO loss#
Maximum likelihood under Bradley-Terry:
L_DPO = -E[log σ(β × log[π(A)/π_ref(A)] - β × log[π(B)/π_ref(B)])]
Directly optimizable! No reward model, no PPO.
2.5 Intuition#
- preferred response: increase log-prob (relative to ref)
- rejected response: decrease log-prob (relative to ref)
- β: how strictly to respect reference (similar to KL weight in RLHF)
2.6 β choice#
- β = 0.1: standard
- β > 0.5: very conservative, stays close to ref
- β < 0.05: aggressive, may degrade
2.7 Reference model#
π_ref typically = SFT model (frozen). DPO learns deviation from SFT toward preferences.
2.8 Memory#
DPO training: 2x memory of SFT (π current + π_ref both needed).
QLoRA-DPO mitigates: 4-bit ref.
python
# DPO with HuggingFace TRLfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom datasets import load_datasetfrom trl import DPOTrainer, DPOConfigfrom peft import LoraConfigimport torch # 1. Load SFT model (will be policy + reference)model = AutoModelForCausalLM.from_pretrained( "sukruyusufkaya/llama-3-8b-tr-instruct", # Module 14 capstone output torch_dtype=torch.bfloat16, device_map="auto",)tokenizer = AutoTokenizer.from_pretrained("sukruyusufkaya/llama-3-8b-tr-instruct") # 2. Türkçe preference dataset# Format: {'prompt': '...', 'chosen': '...', 'rejected': '...'}dataset = load_dataset("sukruyusufkaya/turkish-preferences-10k", split="train") # 3. LoRA config (for parameter-efficient DPO)lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type="CAUSAL_LM",) # 4. DPO configdpo_config = DPOConfig( output_dir="./llama-3-8b-tr-dpo", num_train_epochs=2, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=5e-6, # Even lower than SFT warmup_steps=50, lr_scheduler_type="cosine", bf16=True, beta=0.1, # DPO beta max_length=2048, max_prompt_length=1024, logging_steps=10, save_steps=200,) # 5. DPO trainertrainer = DPOTrainer( model=model, ref_model=None, # TRL auto-creates from base args=dpo_config, train_dataset=dataset, tokenizer=tokenizer, peft_config=lora_config,) # 6. Traintrainer.train()trainer.save_model("./llama-3-8b-tr-dpo/final") # Cost: ~$200-500 single H100 1-2 daysDPO Türkçe alignment — TRL DPOTrainer
🎉 Modül 15 Tamamlandı — RLHF + DPO
2 ders boyunca: RLHF (Ouyang 2022, 3-stage SFT → RM → PPO, ChatGPT'nin gizli sosu), DPO (Rafailov 2023, no RL no RM, direct preference loss). Modern preference: DPO (Llama-3, Mistral, Qwen). Türkçe DPO: $200-500 cost, accessible. Modül 15 envanteri: 2 ders, 145 dk. Genel müfredat: 16 modül, 85 ders, ~82 saat. Sıradaki: Modül 16 — Production Deployment (vLLM, TGI, quantization, monitoring) — son modül!
Modül 15 Envanteri (Tamamlandı)#
| # | Ders | Süre |
|---|---|---|
| 15.1 | RLHF — InstructGPT Ouyang 2022 | 75 dk |
| 15.2 | DPO — Rafailov 2023 | 70 dk |
| Toplam | 2 ders | 145 dk (~2.4 saat) |
Sık Sorulan Sorular
Most cases yes. RLHF advantages: online learning (DPO offline), iterative refinement. OpenAI ChatGPT muhtemelen RLHF + ek tricks. Open ecosystem: DPO dominant 2024+.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
İlgili İçerikler
Modül 0: Kurs Çerçevesi ve Atölye Kurulumu
LLM Engineer Kimdir? Junior'dan Staff'a Yapay Zekâ Mühendisliği Kariyer Haritası
Öğrenmeye BaşlaModül 0: Kurs Çerçevesi ve Atölye Kurulumu
Kurs Felsefesi: Neden Bu Yol, Neden Bu Sıra — 8 Aylık Müfredatın İskeleti
Öğrenmeye BaşlaModül 0: Kurs Çerçevesi ve Atölye Kurulumu