Skip to content

DPO: Direct Preference Optimization — Rafailov 2023, Cheaper Rebirth of RLHF

DPO (Rafailov 2023): mathematical reformulation of RLHF — no reward model, no RL. Direct preference loss. Llama-3 RLHF replacement. Math derivation, implementation simpler than PPO, comparable quality. Turkish DPO practical: $1K cost 8B model alignment.

Şükrü Yusuf KAYA
70 min read
Advanced
DPO: Direct Preference Optimization — Rafailov 2023, RLHF'in Cheaper Yeniden Doğuşu
💎 DPO — RLHF'in 'gizemli karmaşıklığı' ile vedalaşma
Stanford'dan Rafailov, Sharma, Mitchell, Ermon, Finn. Mayıs 2023: 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model'. Devrimsel keşif: RLHF'in 3-stage pipeline'ını tek loss function'a indir. No reward model, no PPO, no RL. Just supervised loss. Aynı objective, çok daha basit. 6 ay içinde RLHF'in tabutuna ilk çivi çakılmıştı. Llama-3, Mistral, Zephyr, Qwen — modern instruct modellerin çoğu DPO. 70 dakika sonra: DPO matematiksel keşfini, niye simpler quality preserves, Llama-3 production usage'ını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

  1. RLHF'in 3-stage karmaşıklığı — niye basitleşmeli
  2. DPO insight — reward model implicit
  3. Math derivation — Bradley-Terry + KL = direct loss
  4. DPO loss formula — final form
  5. β parameter — implicit KL weight
  6. Reference model — π_ref importance
  7. PyTorch implementation — TRL DPOTrainer
  8. Empirical DPO vs RLHF — Rafailov findings
  9. Production Llama-3 DPO — Meta usage
  10. Türkçe DPO — preference dataset + training $1K

2-5. DPO Math#

2.1 RLHF objective recap#

Maximize: E[r(s,a)] - β × KL(π || π_ref)
Reward model r(s,a) learned separately. Optimization via PPO.

2.2 DPO insight#

DPO derivation: closed-form solution for above optimization (under Bradley-Terry assumption):
r*(s,a) = β × log[π*(a|s) / π_ref(a|s)] + Z(s)
Reward = log-ratio of optimal policy to reference, plus state-only term Z(s).

2.3 Plug into Bradley-Terry#

P(A > B) = σ(r*(A) - r*(B)) = σ(β × log[π*(A)/π_ref(A)] - β × log[π*(B)/π_ref(B)])
Z(s) cancels (same state)!

2.4 DPO loss#

Maximum likelihood under Bradley-Terry:
L_DPO = -E[log σ(β × log[π(A)/π_ref(A)] - β × log[π(B)/π_ref(B)])]
Directly optimizable! No reward model, no PPO.

2.5 Intuition#

  • preferred response: increase log-prob (relative to ref)
  • rejected response: decrease log-prob (relative to ref)
  • β: how strictly to respect reference (similar to KL weight in RLHF)

2.6 β choice#

  • β = 0.1: standard
  • β > 0.5: very conservative, stays close to ref
  • β < 0.05: aggressive, may degrade

2.7 Reference model#

π_ref typically = SFT model (frozen). DPO learns deviation from SFT toward preferences.

2.8 Memory#

DPO training: 2x memory of SFT (π current + π_ref both needed). QLoRA-DPO mitigates: 4-bit ref.
python
# DPO with HuggingFace TRL
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
import torch
 
# 1. Load SFT model (will be policy + reference)
model = AutoModelForCausalLM.from_pretrained(
"sukruyusufkaya/llama-3-8b-tr-instruct", # Module 14 capstone output
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("sukruyusufkaya/llama-3-8b-tr-instruct")
 
# 2. Türkçe preference dataset
# Format: {'prompt': '...', 'chosen': '...', 'rejected': '...'}
dataset = load_dataset("sukruyusufkaya/turkish-preferences-10k", split="train")
 
# 3. LoRA config (for parameter-efficient DPO)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
 
# 4. DPO config
dpo_config = DPOConfig(
output_dir="./llama-3-8b-tr-dpo",
num_train_epochs=2,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-6, # Even lower than SFT
warmup_steps=50,
lr_scheduler_type="cosine",
bf16=True,
beta=0.1, # DPO beta
max_length=2048,
max_prompt_length=1024,
logging_steps=10,
save_steps=200,
)
 
# 5. DPO trainer
trainer = DPOTrainer(
model=model,
ref_model=None, # TRL auto-creates from base
args=dpo_config,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=lora_config,
)
 
# 6. Train
trainer.train()
trainer.save_model("./llama-3-8b-tr-dpo/final")
 
# Cost: ~$200-500 single H100 1-2 days
DPO Türkçe alignment — TRL DPOTrainer
🎉 Modül 15 Tamamlandı — RLHF + DPO
2 ders boyunca: RLHF (Ouyang 2022, 3-stage SFT → RM → PPO, ChatGPT'nin gizli sosu), DPO (Rafailov 2023, no RL no RM, direct preference loss). Modern preference: DPO (Llama-3, Mistral, Qwen). Türkçe DPO: $200-500 cost, accessible. Modül 15 envanteri: 2 ders, 145 dk. Genel müfredat: 16 modül, 85 ders, ~82 saat. Sıradaki: Modül 16 — Production Deployment (vLLM, TGI, quantization, monitoring) — son modül!

Modül 15 Envanteri (Tamamlandı)#

#DersSüre
15.1RLHF — InstructGPT Ouyang 202275 dk
15.2DPO — Rafailov 202370 dk
Toplam2 ders145 dk (~2.4 saat)

Frequently Asked Questions

Most cases yes. RLHF advantages: online learning (DPO offline), iterative refinement. OpenAI ChatGPT probably RLHF + additional tricks. Open ecosystem: DPO dominant 2024+.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content