Is open-source RLHF possible?

RLHF: Reinforcement Learning from Human Feedback — From Ouyang 2022 InstructGPT to ChatGPT

Full anatomy of RLHF: SFT model → reward model training (Bradley-Terry) → PPO RL training. Ouyang 2022 InstructGPT paper, 3-stage pipeline, KL divergence penalty, reward hacking concerns. ChatGPT's secret sauce. Turkish RLHF challenges (human annotator pool, cultural nuances).

Şükrü Yusuf KAYA

75 min read

5/13/2026

Advanced

RLHF: Reinforcement Learning from Human Feedback — Ouyang 2022 InstructGPT'den ChatGPT'ye

🏆 RLHF — ChatGPT'yi GPT-3'ten ayıran şey

Aralık 2022. ChatGPT lansmanı. Dünya şok. GPT-3 zaten 2 yıldır var, ama ChatGPT bambaşka hissettiriyor. Neden? RLHF. Reinforcement Learning from Human Feedback. Ouyang et al. 2022 paper'ı 'Training language models to follow instructions with human feedback'. 3-stage pipeline: SFT → reward model → PPO. Bu mekanizma model'i 'helpful, harmless, honest' yapıyor. 75 dakika sonra: RLHF'in matematiksel anatomisini, reward model (Bradley-Terry), PPO loss, KL penalty trick'lerini derinlemesine kavramış olacaksın.

Ders Haritası (10 Bölüm)#

SFT yetersiz — niye RLHF gerekli
3-stage pipeline overview
Stage 1: SFT — Modül 14'ten devam
Stage 2: Reward Model — Bradley-Terry math
Human preference data — A/B comparison collection
Stage 3: PPO RL training — policy optimization
KL divergence penalty — prevent reward hacking
Reward hacking — sycophancy, gaming
InstructGPT empirical — Ouyang findings
Türkçe RLHF challenges — annotator pool, culture

1-7. RLHF Pipeline#

1.1 SFT'nin sınırı#

SFT model: instruction-following ama 'iyi mi' bilmiyor. Multiple valid responses arasında tercih yok.

Örnek: 'Türkiye'nin başkenti?' Responses:

A: 'Ankara'
B: 'Türkiye'nin başkenti Ankara'dır.'
C: 'Ankara şehri, 1923'ten beri Türkiye'nin başkentidir. Daha önce İstanbul...'

Hepsi doğru. Hangisi better? SFT eşit görür.

RLHF: human evaluators choose preferred response. Model bu preferences'i öğrenir.

1.2 Stage 1: SFT#

Modül 14'te detay. Pre-trained base + instruction dataset → SFT model.

Ouyang: 13K SFT examples for InstructGPT.

1.3 Stage 2: Reward Model#

Reward Model (RM): input + response → scalar reward score.

Training data: human-labeled comparisons.

Query: 'Türkiye'nin başkenti?'
Response A: 'Ankara' (preferred by human)
Response B: 'İstanbul' (worse by human)
Label: A > B

Many such comparisons (Ouyang: 33K InstructGPT). Train RM to predict preferences.

1.4 Bradley-Terry model#

Math foundation:

P(A > B) = σ(r(A) - r(B))

σ = sigmoid. Reward difference → preference probability.

Loss (negative log-likelihood):

L_RM = -E[log σ(r(A) - r(B))]

RM learns to score responses such that preferred ones get higher reward.

1.5 RM architecture#

İlk versions: SFT model copy + extra value head:

hidden = transformer(input)
reward = linear(hidden[:, -1, :])  # last token's reward

Modern: separate small model (~2B param). Compute efficient.

1.6 Stage 3: PPO#

Proximal Policy Optimization (Schulman 2017). Reinforcement learning algorithm.

Agent: SFT model (with adjusted weights). Reward: RM output.

Objective:

L_PPO = E[r(s, a) - β × KL(π_new || π_old)]

π_new: current policy (model being trained)
π_old: SFT model (frozen reference)
KL penalty: prevents policy drifting too far from SFT

1.7 KL penalty critical role#

Without KL: model degenerates. 'Reward hacking' — gibberish that fools RM. With KL: stay close to SFT, only small adjustments.

β = 0.01 to 0.1 typical.

1.8 PPO loss expanded#

L_PPO_clip = E[min(
    π_new(a|s)/π_old(a|s) × A,
    clip(π_new/π_old, 1-ε, 1+ε) × A
)]

A = advantage (reward - baseline). clip prevents large policy updates.

1.9 InstructGPT empirical#

Ouyang 2022 findings:

SFT: 71% preferred over GPT-3 175B (humans)
- RLHF: 88% preferred over GPT-3 175B

RLHF dramatic improvement vs pure SFT.

1.10 ChatGPT (2022)#

ChatGPT = InstructGPT pipeline + better data + more iterations. Closed: exact data + iterations unclear. But same 3-stage architecture.

8-10. Reward Hacking + Türkçe#

8.1 Reward hacking#

RM imperfect. Model finds shortcuts that fool RM, not actually good:

Sycophancy: always agree with user (high RM reward, bad behavior)
Verbosity: longer answers (RM bias toward length)
Hedging: 'I am not sure but...' (RM bias toward humility)
Refusals: refuse complex queries (avoid hard tasks)

8.2 Mitigation#

KL penalty tighter
RM re-training with hacked examples
Multiple RM ensemble
Constitutional AI (Anthropic) — rule-based

8.3 Türkçe RLHF challenges#

Annotator pool: Türkçe-fluent humans scarce in commercial labelling firms
Cultural nuance: Türkçe humor, formality (sen/siz) hard to capture
Bias: Türkçe corpus İstanbul-centric, Anadolu underrepresented
Compliance: KVKK, content moderation Türkçe-specific

8.4 Pragmatic Türkçe approach#

Translate English RLHF data: cheap but lossy
Curate Türkçe-native preferences (expensive)
Hybrid: pre-train RM on English, fine-tune on Türkçe subset

8.5 Cost#

Full RLHF pipeline 7B model: 100K USD+ compute + annotation. Beyond hobbyist scope.

DPO (Ders 15.2) — cheaper alternative, no RL.

✅ Ders 15.1 Özeti — RLHF

RLHF (Ouyang 2022): 3-stage pipeline. (1) SFT, (2) Reward Model (Bradley-Terry preference modeling), (3) PPO RL training (KL-penalized). ChatGPT'nin gizli sosu. InstructGPT 71% → 88% preferred over GPT-3. Reward hacking: sycophancy, verbosity, hedging. Mitigation: tight KL, RM ensemble, Constitutional AI. Türkçe RLHF challenges: annotator pool, culture, KVKK. Cost: 100K USD+ — hobbyist scope dışı. Ders 15.2'de DPO — cheaper alternative.

Sıradaki Ders: DPO Direct Preference Optimization#

Ders 15.2: DPO (Rafailov 2023), no reward model, no RL, direct preference loss. Llama-3 RLHF replacement. Pratik production fiili standard.

Frequently Asked Questions

Yes, TRL library (HuggingFace) supports full RLHF pipeline. But compute expensive (PPO 4x SFT cost), annotation expensive. Most open models prefer DPO.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...