RLHF: Reinforcement Learning from Human Feedback — Ouyang 2022 InstructGPT'den ChatGPT'ye
RLHF'in tam anatomisi: SFT model → reward model training (Bradley-Terry) → PPO RL training. Ouyang 2022 InstructGPT paper, 3-stage pipeline, KL divergence penalty, reward hacking concerns. ChatGPT'nin gizli sosu. Türkçe RLHF zorlukları (human annotator pool, cultural nuances).
Şükrü Yusuf KAYA
75 dakikalık okuma
İleri🏆 RLHF — ChatGPT'yi GPT-3'ten ayıran şey
Aralık 2022. ChatGPT lansmanı. Dünya şok. GPT-3 zaten 2 yıldır var, ama ChatGPT bambaşka hissettiriyor. Neden? RLHF. Reinforcement Learning from Human Feedback. Ouyang et al. 2022 paper'ı 'Training language models to follow instructions with human feedback'. 3-stage pipeline: SFT → reward model → PPO. Bu mekanizma model'i 'helpful, harmless, honest' yapıyor. 75 dakika sonra: RLHF'in matematiksel anatomisini, reward model (Bradley-Terry), PPO loss, KL penalty trick'lerini derinlemesine kavramış olacaksın.
Ders Haritası (10 Bölüm)#
- SFT yetersiz — niye RLHF gerekli
- 3-stage pipeline overview
- Stage 1: SFT — Modül 14'ten devam
- Stage 2: Reward Model — Bradley-Terry math
- Human preference data — A/B comparison collection
- Stage 3: PPO RL training — policy optimization
- KL divergence penalty — prevent reward hacking
- Reward hacking — sycophancy, gaming
- InstructGPT empirical — Ouyang findings
- Türkçe RLHF challenges — annotator pool, culture
1-7. RLHF Pipeline#
1.1 SFT'nin sınırı#
SFT model: instruction-following ama 'iyi mi' bilmiyor. Multiple valid responses arasında tercih yok.
Örnek: 'Türkiye'nin başkenti?'
Responses:
- A: 'Ankara'
- B: 'Türkiye'nin başkenti Ankara'dır.'
- C: 'Ankara şehri, 1923'ten beri Türkiye'nin başkentidir. Daha önce İstanbul...'
Hepsi doğru. Hangisi better? SFT eşit görür.
RLHF: human evaluators choose preferred response. Model bu preferences'i öğrenir.
1.2 Stage 1: SFT#
Modül 14'te detay. Pre-trained base + instruction dataset → SFT model.
Ouyang: 13K SFT examples for InstructGPT.
1.3 Stage 2: Reward Model#
Reward Model (RM): input + response → scalar reward score.
Training data: human-labeled comparisons.
Query: 'Türkiye'nin başkenti?' Response A: 'Ankara' (preferred by human) Response B: 'İstanbul' (worse by human) Label: A > B
Many such comparisons (Ouyang: 33K InstructGPT). Train RM to predict preferences.
1.4 Bradley-Terry model#
Math foundation:
P(A > B) = σ(r(A) - r(B))
σ = sigmoid. Reward difference → preference probability.
Loss (negative log-likelihood):
L_RM = -E[log σ(r(A) - r(B))]
RM learns to score responses such that preferred ones get higher reward.
1.5 RM architecture#
İlk versions: SFT model copy + extra value head:
hidden = transformer(input) reward = linear(hidden[:, -1, :]) # last token's reward
Modern: separate small model (~2B param). Compute efficient.
1.6 Stage 3: PPO#
Proximal Policy Optimization (Schulman 2017). Reinforcement learning algorithm.
Agent: SFT model (with adjusted weights). Reward: RM output.
Objective:
L_PPO = E[r(s, a) - β × KL(π_new || π_old)]
- π_new: current policy (model being trained)
- π_old: SFT model (frozen reference)
- KL penalty: prevents policy drifting too far from SFT
1.7 KL penalty critical role#
Without KL: model degenerates. 'Reward hacking' — gibberish that fools RM.
With KL: stay close to SFT, only small adjustments.
β = 0.01 to 0.1 typical.
1.8 PPO loss expanded#
L_PPO_clip = E[min( π_new(a|s)/π_old(a|s) × A, clip(π_new/π_old, 1-ε, 1+ε) × A )]
A = advantage (reward - baseline). clip prevents large policy updates.
1.9 InstructGPT empirical#
Ouyang 2022 findings:
- SFT: 71% preferred over GPT-3 175B (humans)
-
- RLHF: 88% preferred over GPT-3 175B
RLHF dramatic improvement vs pure SFT.
1.10 ChatGPT (2022)#
ChatGPT = InstructGPT pipeline + better data + more iterations.
Closed: exact data + iterations unclear. But same 3-stage architecture.
8-10. Reward Hacking + Türkçe#
8.1 Reward hacking#
RM imperfect. Model finds shortcuts that fool RM, not actually good:
- Sycophancy: always agree with user (high RM reward, bad behavior)
- Verbosity: longer answers (RM bias toward length)
- Hedging: 'I am not sure but...' (RM bias toward humility)
- Refusals: refuse complex queries (avoid hard tasks)
8.2 Mitigation#
- KL penalty tighter
- RM re-training with hacked examples
- Multiple RM ensemble
- Constitutional AI (Anthropic) — rule-based
8.3 Türkçe RLHF challenges#
- Annotator pool: Türkçe-fluent humans scarce in commercial labelling firms
- Cultural nuance: Türkçe humor, formality (sen/siz) hard to capture
- Bias: Türkçe corpus İstanbul-centric, Anadolu underrepresented
- Compliance: KVKK, content moderation Türkçe-specific
8.4 Pragmatic Türkçe approach#
- Translate English RLHF data: cheap but lossy
- Curate Türkçe-native preferences (expensive)
- Hybrid: pre-train RM on English, fine-tune on Türkçe subset
8.5 Cost#
Full RLHF pipeline 7B model: 100K USD+ compute + annotation. Beyond hobbyist scope.
DPO (Ders 15.2) — cheaper alternative, no RL.
✅ Ders 15.1 Özeti — RLHF
RLHF (Ouyang 2022): 3-stage pipeline. (1) SFT, (2) Reward Model (Bradley-Terry preference modeling), (3) PPO RL training (KL-penalized). ChatGPT'nin gizli sosu. InstructGPT 71% → 88% preferred over GPT-3. Reward hacking: sycophancy, verbosity, hedging. Mitigation: tight KL, RM ensemble, Constitutional AI. Türkçe RLHF challenges: annotator pool, culture, KVKK. Cost: 100K USD+ — hobbyist scope dışı. Ders 15.2'de DPO — cheaper alternative.
Sıradaki Ders: DPO Direct Preference Optimization#
Ders 15.2: DPO (Rafailov 2023), no reward model, no RL, direct preference loss. Llama-3 RLHF replacement. Pratik production fiili standard.
Sık Sorulan Sorular
Evet, TRL library (HuggingFace) RLHF tüm pipeline. Ama compute pahalı (PPO 4x SFT cost), annotation pahalı. Çoğu open model DPO tercih.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
İlgili İçerikler
Modül 0: Kurs Çerçevesi ve Atölye Kurulumu
LLM Engineer Kimdir? Junior'dan Staff'a Yapay Zekâ Mühendisliği Kariyer Haritası
Öğrenmeye BaşlaModül 0: Kurs Çerçevesi ve Atölye Kurulumu
Kurs Felsefesi: Neden Bu Yol, Neden Bu Sıra — 8 Aylık Müfredatın İskeleti
Öğrenmeye BaşlaModül 0: Kurs Çerçevesi ve Atölye Kurulumu