RLHF: Reinforcement Learning from Human Feedback — From Ouyang 2022 InstructGPT to ChatGPT
Full anatomy of RLHF: SFT model → reward model training (Bradley-Terry) → PPO RL training. Ouyang 2022 InstructGPT paper, 3-stage pipeline, KL divergence penalty, reward hacking concerns. ChatGPT's secret sauce. Turkish RLHF challenges (human annotator pool, cultural nuances).
Şükrü Yusuf KAYA
75 min read
Advanced🏆 RLHF — ChatGPT'yi GPT-3'ten ayıran şey
Aralık 2022. ChatGPT lansmanı. Dünya şok. GPT-3 zaten 2 yıldır var, ama ChatGPT bambaşka hissettiriyor. Neden? RLHF. Reinforcement Learning from Human Feedback. Ouyang et al. 2022 paper'ı 'Training language models to follow instructions with human feedback'. 3-stage pipeline: SFT → reward model → PPO. Bu mekanizma model'i 'helpful, harmless, honest' yapıyor. 75 dakika sonra: RLHF'in matematiksel anatomisini, reward model (Bradley-Terry), PPO loss, KL penalty trick'lerini derinlemesine kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- SFT yetersiz — niye RLHF gerekli
- 3-stage pipeline overview
- Stage 1: SFT — Modül 14'ten devam
- Stage 2: Reward Model — Bradley-Terry math
- Human preference data — A/B comparison collection
- Stage 3: PPO RL training — policy optimization
- KL divergence penalty — prevent reward hacking
- Reward hacking — sycophancy, gaming
- InstructGPT empirical — Ouyang findings
- Türkçe RLHF challenges — annotator pool, culture
1-7. RLHF Pipeline#
1.1 SFT'nin sınırı#
SFT model: instruction-following ama 'iyi mi' bilmiyor. Multiple valid responses arasında tercih yok.
Örnek: 'Türkiye'nin başkenti?'
Responses:
- A: 'Ankara'
- B: 'Türkiye'nin başkenti Ankara'dır.'
- C: 'Ankara şehri, 1923'ten beri Türkiye'nin başkentidir. Daha önce İstanbul...'
Hepsi doğru. Hangisi better? SFT eşit görür.
RLHF: human evaluators choose preferred response. Model bu preferences'i öğrenir.
1.2 Stage 1: SFT#
Modül 14'te detay. Pre-trained base + instruction dataset → SFT model.
Ouyang: 13K SFT examples for InstructGPT.
1.3 Stage 2: Reward Model#
Reward Model (RM): input + response → scalar reward score.
Training data: human-labeled comparisons.
Query: 'Türkiye'nin başkenti?' Response A: 'Ankara' (preferred by human) Response B: 'İstanbul' (worse by human) Label: A > B
Many such comparisons (Ouyang: 33K InstructGPT). Train RM to predict preferences.
1.4 Bradley-Terry model#
Math foundation:
P(A > B) = σ(r(A) - r(B))
σ = sigmoid. Reward difference → preference probability.
Loss (negative log-likelihood):
L_RM = -E[log σ(r(A) - r(B))]
RM learns to score responses such that preferred ones get higher reward.
1.5 RM architecture#
İlk versions: SFT model copy + extra value head:
hidden = transformer(input) reward = linear(hidden[:, -1, :]) # last token's reward
Modern: separate small model (~2B param). Compute efficient.
1.6 Stage 3: PPO#
Proximal Policy Optimization (Schulman 2017). Reinforcement learning algorithm.
Agent: SFT model (with adjusted weights). Reward: RM output.
Objective:
L_PPO = E[r(s, a) - β × KL(π_new || π_old)]
- π_new: current policy (model being trained)
- π_old: SFT model (frozen reference)
- KL penalty: prevents policy drifting too far from SFT
1.7 KL penalty critical role#
Without KL: model degenerates. 'Reward hacking' — gibberish that fools RM.
With KL: stay close to SFT, only small adjustments.
β = 0.01 to 0.1 typical.
1.8 PPO loss expanded#
L_PPO_clip = E[min( π_new(a|s)/π_old(a|s) × A, clip(π_new/π_old, 1-ε, 1+ε) × A )]
A = advantage (reward - baseline). clip prevents large policy updates.
1.9 InstructGPT empirical#
Ouyang 2022 findings:
- SFT: 71% preferred over GPT-3 175B (humans)
-
- RLHF: 88% preferred over GPT-3 175B
RLHF dramatic improvement vs pure SFT.
1.10 ChatGPT (2022)#
ChatGPT = InstructGPT pipeline + better data + more iterations.
Closed: exact data + iterations unclear. But same 3-stage architecture.
8-10. Reward Hacking + Türkçe#
8.1 Reward hacking#
RM imperfect. Model finds shortcuts that fool RM, not actually good:
- Sycophancy: always agree with user (high RM reward, bad behavior)
- Verbosity: longer answers (RM bias toward length)
- Hedging: 'I am not sure but...' (RM bias toward humility)
- Refusals: refuse complex queries (avoid hard tasks)
8.2 Mitigation#
- KL penalty tighter
- RM re-training with hacked examples
- Multiple RM ensemble
- Constitutional AI (Anthropic) — rule-based
8.3 Türkçe RLHF challenges#
- Annotator pool: Türkçe-fluent humans scarce in commercial labelling firms
- Cultural nuance: Türkçe humor, formality (sen/siz) hard to capture
- Bias: Türkçe corpus İstanbul-centric, Anadolu underrepresented
- Compliance: KVKK, content moderation Türkçe-specific
8.4 Pragmatic Türkçe approach#
- Translate English RLHF data: cheap but lossy
- Curate Türkçe-native preferences (expensive)
- Hybrid: pre-train RM on English, fine-tune on Türkçe subset
8.5 Cost#
Full RLHF pipeline 7B model: 100K USD+ compute + annotation. Beyond hobbyist scope.
DPO (Ders 15.2) — cheaper alternative, no RL.
✅ Ders 15.1 Özeti — RLHF
RLHF (Ouyang 2022): 3-stage pipeline. (1) SFT, (2) Reward Model (Bradley-Terry preference modeling), (3) PPO RL training (KL-penalized). ChatGPT'nin gizli sosu. InstructGPT 71% → 88% preferred over GPT-3. Reward hacking: sycophancy, verbosity, hedging. Mitigation: tight KL, RM ensemble, Constitutional AI. Türkçe RLHF challenges: annotator pool, culture, KVKK. Cost: 100K USD+ — hobbyist scope dışı. Ders 15.2'de DPO — cheaper alternative.
Sıradaki Ders: DPO Direct Preference Optimization#
Ders 15.2: DPO (Rafailov 2023), no reward model, no RL, direct preference loss. Llama-3 RLHF replacement. Pratik production fiili standard.
Frequently Asked Questions
Yes, TRL library (HuggingFace) supports full RLHF pipeline. But compute expensive (PPO 4x SFT cost), annotation expensive. Most open models prefer DPO.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Module 0: Course Framework & Workshop Setup
Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff
Start LearningModule 0: Course Framework & Workshop Setup
Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum
Start LearningModule 0: Course Framework & Workshop Setup