RLHF Klasik: Reward Model + PPO + KL Constraint Tam Anlatım

Classical RLHF: Reward Model + PPO + KL Constraint — Why Industry Abandoned It

RLHF (Christiano et al. 2017, InstructGPT 2022) — foundation of modern alignment. 3 stages: SFT base + reward model train + PPO with KL constraint. Why it largely vanished from industry? PPO instability, value head maintenance burden, DPO's practical superiority. Mini-RLHF demo with TRL on RTX 4090.

Şükrü Yusuf KAYA

32 min read

6/22/2026

Advanced

1. RLHF — 3 Aşamalı Pipeline#

Aşama 1: SFT
  base model → SFT on instruction dataset → π_SFT

Aşama 2: Reward model
  prompts × {chosen, rejected} → train RM (regression on preference score) → R_φ
  Loss: -log σ(R_φ(chosen) - R_φ(rejected))

Aşama 3: PPO
  π_SFT'nin ağırlıklarını başlat
  her step:
    - Sample response from π_θ
    - Compute reward: r = R_φ(response) - β · KL(π_θ || π_SFT)
    - PPO update: maximize clipped objective

2. KL Constraint — Niye Önemli?#

PPO modelinin kalitesini optimize ederken SFT'ye yakın kalmalı — çünkü:

Reward model imperfect (hata yapar)
Modelin "reward gaming" yapma riski (over-optimization)
Generation çeşitliliğini koru

KL constraint:

r_total = r_RM - β · KL(π_θ || π_SFT)

β = 0.01-0.1
tipik
KL büyük → policy SFT'den uzak → penalti
KL küçük → SFT yakın kal

3. Niye Üretim Seti RLHF'yi Terk Etti?#

Problem	Etki
3-stage pipeline complexity	Kod tabanı büyük, debug zor
PPO instability	Hyperparam-sensitive, divergence riski
Value head maintenance	Ek 1B param, training cost yüksek
Reward hacking	Model RM'i exploit eder, kalite düşer
GPU memory	4 model birden: actor, critic, RM, ref → 4×W
DPO equivalent	Aynı kalite, 1 stage, daha stable

Sonuç: 2023 sonrası endüstri DPO/ORPO/KTO'ya kaydı. RLHF/PPO hâlâ research'te (özellikle reasoning RL) ama production'da nadir.

✅ Teslim

RLHF pipeline'ı kavramsal anla — modern alignment'ın 'before' resmi. 2) TRL'in PPOTrainer dokümentasyonu oku. 3) Sonraki ders: 11.2 — DPO Math (Bradley-Terry Derivation).

Classical RLHF: Reward Model + PPO + KL Constraint — Why Industry Abandoned It

1. RLHF — 3 Aşamalı Pipeline#

2. KL Constraint — Niye Önemli?#

3. Niye Üretim Seti RLHF'yi Terk Etti?#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter