Skip to content
Technical GlossaryGenerative AI and LLM

Direct Preference Optimization

A simpler alignment approach that learns directly from preference pairs.

DPO offers a more direct alignment method than the classical reward-model-plus-reinforcement-learning pipeline. Human or system preferences are communicated to the model through pairwise comparisons. This can provide alignment processes that are more stable and easier to optimize in practice.