# DPO Revolution: Rafailov 2023's Mathematical Discovery — Compressing RLHF into a Single Loss Function

> Source: https://sukruyusufkaya.com/en/learn/llm-muhendisligi/dpo-devrim-rafailov-2023-matematik-kesfi
> Updated: 2026-05-13T13:00:29.930Z
> Category: LLM Mühendisliği
> Module: Module 15: Preference Alignment — RLHF, PPO, DPO, GRPO
**TLDR:** Direct Preference Optimization (Rafailov et al. 2023): full derivation of the mathematical discovery that compresses RLHF's 3 stages into a single supervised loss. Reward model's 'hidden reformulation', optimum solution of Bradley-Terry + KL constraint, why DPO says 'every LLM is already a reward model', mathematical meaning of closed-form solution. Numerical comparison with PPO, modern DPO variants (IPO, KTO, SimPO), Turkish DPO production pipeline.

