DPO Family: SimPO + IPO + CPO + RPO + APO — Decision Matrix of 5 Variants
DPO family expanded in 2023-2024: SimPO (Meng et al.) — length-normalized, IPO (Azar et al.) — overfit fix, CPO (Xu et al.) — KL ratio fix, RPO (Iterative) — online iterative, APO (anchored). Loss formula for each, when to use which, quick RTX 4090 comparison.
Şükrü Yusuf KAYA
30 min read
Advanced1. 5 Varyant Tablo#
| Method | Key Innovation | Loss Form |
|---|---|---|
| DPO (Rafailov 2023) | Implicit reward via π/π_ref | -log σ(β·(log_ratio_w - log_ratio_l)) |
| IPO (Azar 2023) | Squared loss, no overfit | (log_ratio_w - log_ratio_l - 1/(2β))² |
| SimPO (Meng 2024) | Length-normalized, no ref | `-log σ(β · (1/ |
| CPO (Xu 2024) | SFT + DPO joint | L_SFT + L_DPO_simplified |
| RPO (Pang 2024) | Online iterative SPIN-like | DPO + new pref pairs from current policy |
| APO (Kotzias 2024) | Anchored to gold | DPO + L2 anchor on chosen |
Cookbook tavsiye:
- Basit production → DPO (en yaygın, en sağlam)
- Overfitting korkusu varsa → IPO
- Reference model'i atmak isteyenler → SimPO (no ref model = memory tasarrufu)
- SFT + alignment tek aşamada → CPO veya ORPO
- Long-running online RL → RPO
2. SimPO Detayı#
SimPO (Meng et al. 2024):
L_SimPO = -log σ(β/|y_w| · log π(y_w) - β/|y_l| · log π(y_l) - γ)
- Length-normalization () — uzun cevap bias'ı bitirir
/|y_w| - No reference model — memory ~40% tasarrufu
- γ target margin — chosen ile rejected arası "minimum margin"
- β = 2.0-2.5 sweet spot (DPO'dan farklı)
- γ = 0.5-1.4
SimPO TRL'de ile direkt:
loss_type="simpo"cfg = DPOConfig(loss_type="simpo", beta=2.5, simpo_gamma=1.0, ...)
Sonuç: AlpacaEval 2.0'da DPO'yu geçtiği yer var; ama bazı setting'lerde DPO yine sağlam.
✅ Teslim
- Aynı dataset üzerinde DPO, IPO, SimPO koş. 2) MT-Bench-TR'da karşılaştır. 3) Sonraki ders: 11.7 — GRPO (DeepSeek-R1 Reçetesi).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations