# Classical RLHF: Reward Model + PPO + KL Constraint — Why Industry Abandoned It

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-rlhf-classical-reward-model-ppo-kl
> Updated: 2026-05-14T14:42:57.815Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part XI — Alignment & Preference Optimization
**TLDR:** RLHF (Christiano et al. 2017, InstructGPT 2022) — foundation of modern alignment. 3 stages: SFT base + reward model train + PPO with KL constraint. Why it largely vanished from industry? PPO instability, value head maintenance burden, DPO's practical superiority. Mini-RLHF demo with TRL on RTX 4090.

