Classical RLHF: Reward Model + PPO + KL Constraint — Why Industry Abandoned It
RLHF (Christiano et al. 2017, InstructGPT 2022) — foundation of modern alignment. 3 stages: SFT base + reward model train + PPO with KL constraint. Why it largely vanished from industry? PPO instability, value head maintenance burden, DPO's practical superiority. Mini-RLHF demo with TRL on RTX 4090.
Şükrü Yusuf KAYA
32 min read
Advanced1. RLHF — 3 Aşamalı Pipeline#
Aşama 1: SFT base model → SFT on instruction dataset → π_SFT Aşama 2: Reward model prompts × {chosen, rejected} → train RM (regression on preference score) → R_φ Loss: -log σ(R_φ(chosen) - R_φ(rejected)) Aşama 3: PPO π_SFT'nin ağırlıklarını başlat her step: - Sample response from π_θ - Compute reward: r = R_φ(response) - β · KL(π_θ || π_SFT) - PPO update: maximize clipped objective
2. KL Constraint — Niye Önemli?#
PPO modelinin kalitesini optimize ederken SFT'ye yakın kalmalı — çünkü:
- Reward model imperfect (hata yapar)
- Modelin "reward gaming" yapma riski (over-optimization)
- Generation çeşitliliğini koru
KL constraint:
r_total = r_RM - β · KL(π_θ || π_SFT)- tipik
β = 0.01-0.1 - KL büyük → policy SFT'den uzak → penalti
- KL küçük → SFT yakın kal
3. Niye Üretim Seti RLHF'yi Terk Etti?#
| Problem | Etki |
|---|---|
| 3-stage pipeline complexity | Kod tabanı büyük, debug zor |
| PPO instability | Hyperparam-sensitive, divergence riski |
| Value head maintenance | Ek 1B param, training cost yüksek |
| Reward hacking | Model RM'i exploit eder, kalite düşer |
| GPU memory | 4 model birden: actor, critic, RM, ref → 4×W |
| DPO equivalent | Aynı kalite, 1 stage, daha stable |
Sonuç: 2023 sonrası endüstri DPO/ORPO/KTO'ya kaydı. RLHF/PPO hâlâ research'te (özellikle reasoning RL) ama production'da nadir.
✅ Teslim
- RLHF pipeline'ı kavramsal anla — modern alignment'ın 'before' resmi. 2) TRL'in PPOTrainer dokümentasyonu oku. 3) Sonraki ders: 11.2 — DPO Math (Bradley-Terry Derivation).
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations