Does DPO completely replace RLHF?

Q: Does DPO completely replace RLHF?

Most cases yes. RLHF advantages: online learning (DPO offline), iterative refinement. OpenAI ChatGPT probably RLHF + additional tricks. Open ecosystem: DPO dominant 2024+.

DPO: Direct Preference Optimization — Rafailov 2023, Cheaper Rebirth of RLHF

DPO (Rafailov 2023): mathematical reformulation of RLHF — no reward model, no RL. Direct preference loss. Llama-3 RLHF replacement. Math derivation, implementation simpler than PPO, comparable quality. Turkish DPO practical: $1K cost 8B model alignment.

Şükrü Yusuf KAYA

70 min read

5/13/2026

Advanced

DPO: Direct Preference Optimization — Rafailov 2023, RLHF'in Cheaper Yeniden Doğuşu

💎 DPO — RLHF'in 'gizemli karmaşıklığı' ile vedalaşma

Stanford'dan Rafailov, Sharma, Mitchell, Ermon, Finn. Mayıs 2023: 'Direct Preference Optimization: Your Language Model is Secretly a Reward Model'. Devrimsel keşif: RLHF'in 3-stage pipeline'ını tek loss function'a indir. No reward model, no PPO, no RL. Just supervised loss. Aynı objective, çok daha basit. 6 ay içinde RLHF'in tabutuna ilk çivi çakılmıştı. Llama-3, Mistral, Zephyr, Qwen — modern instruct modellerin çoğu DPO. 70 dakika sonra: DPO matematiksel keşfini, niye simpler quality preserves, Llama-3 production usage'ını kavramış olacaksın.

Ders Haritası (10 Bölüm)#

RLHF'in 3-stage karmaşıklığı — niye basitleşmeli
DPO insight — reward model implicit
Math derivation — Bradley-Terry + KL = direct loss
DPO loss formula — final form
β parameter — implicit KL weight
Reference model — π_ref importance
PyTorch implementation — TRL DPOTrainer
Empirical DPO vs RLHF — Rafailov findings
Production Llama-3 DPO — Meta usage
Türkçe DPO — preference dataset + training $1K

2-5. DPO Math#

2.1 RLHF objective recap#

Maximize: E[r(s,a)] - β × KL(π || π_ref)

Reward model r(s,a) learned separately. Optimization via PPO.

2.2 DPO insight#

DPO derivation: closed-form solution for above optimization (under Bradley-Terry assumption):

r*(s,a) = β × log[π*(a|s) / π_ref(a|s)] + Z(s)

Reward = log-ratio of optimal policy to reference, plus state-only term Z(s).

2.3 Plug into Bradley-Terry#

P(A > B) = σ(r*(A) - r*(B))
         = σ(β × log[π*(A)/π_ref(A)] - β × log[π*(B)/π_ref(B)])

Z(s) cancels (same state)!

2.4 DPO loss#

Maximum likelihood under Bradley-Terry:

L_DPO = -E[log σ(β × log[π(A)/π_ref(A)] - β × log[π(B)/π_ref(B)])]

Directly optimizable! No reward model, no PPO.

2.5 Intuition#

preferred response: increase log-prob (relative to ref)
rejected response: decrease log-prob (relative to ref)
β: how strictly to respect reference (similar to KL weight in RLHF)

2.6 β choice#

β = 0.1: standard
β > 0.5: very conservative, stays close to ref
β < 0.05: aggressive, may degrade

2.7 Reference model#

π_ref typically = SFT model (frozen). DPO learns deviation from SFT toward preferences.

2.8 Memory#

DPO training: 2x memory of SFT (π current + π_ref both needed). QLoRA-DPO mitigates: 4-bit ref.

python

# DPO with HuggingFace TRL
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
import torch
 
# 1. Load SFT model (will be policy + reference)
model = AutoModelForCausalLM.from_pretrained(
    "sukruyusufkaya/llama-3-8b-tr-instruct",  # Module 14 capstone output
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("sukruyusufkaya/llama-3-8b-tr-instruct")
 
# 2. Türkçe preference dataset
# Format: {'prompt': '...', 'chosen': '...', 'rejected': '...'}
dataset = load_dataset("sukruyusufkaya/turkish-preferences-10k", split="train")
 
# 3. LoRA config (for parameter-efficient DPO)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)
 
# 4. DPO config
dpo_config = DPOConfig(
    output_dir="./llama-3-8b-tr-dpo",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-6,    # Even lower than SFT
    warmup_steps=50,
    lr_scheduler_type="cosine",
    bf16=True,
    beta=0.1,              # DPO beta
    max_length=2048,
    max_prompt_length=1024,
    logging_steps=10,
    save_steps=200,
)
 
# 5. DPO trainer
trainer = DPOTrainer(
    model=model,
    ref_model=None,         # TRL auto-creates from base
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
)
 
# 6. Train
trainer.train()
trainer.save_model("./llama-3-8b-tr-dpo/final")
 
# Cost: ~$200-500 single H100 1-2 days

DPO Türkçe alignment — TRL DPOTrainer

🎉 Modül 15 Tamamlandı — RLHF + DPO

2 ders boyunca: RLHF (Ouyang 2022, 3-stage SFT → RM → PPO, ChatGPT'nin gizli sosu), DPO (Rafailov 2023, no RL no RM, direct preference loss). Modern preference: DPO (Llama-3, Mistral, Qwen). Türkçe DPO: $200-500 cost, accessible. Modül 15 envanteri: 2 ders, 145 dk. Genel müfredat: 16 modül, 85 ders, ~82 saat. Sıradaki: Modül 16 — Production Deployment (vLLM, TGI, quantization, monitoring) — son modül!

Modül 15 Envanteri (Tamamlandı)#

#	Ders	Süre
15.1	RLHF — InstructGPT Ouyang 2022	75 dk
15.2	DPO — Rafailov 2023	70 dk
Toplam	2 ders	145 dk (~2.4 saat)

Frequently Asked Questions

Most cases yes. RLHF advantages: online learning (DPO offline), iterative refinement. OpenAI ChatGPT probably RLHF + additional tricks. Open ecosystem: DPO dominant 2024+.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...