GRPO (Group Relative Policy Optimization): DeepSeek-R1's Verifiable Reward Recipe
GRPO (DeepSeek 2024) — simplified variant of PPO. No critic/value head. Sample G different responses per batch, normalize **relative rewards** within group. Verifiable rewards (math correctness, code execution) enable reasoning RL. Qwen-7B + GRPO + GSM8K accuracy +5-8% on RTX 4090.
Şükrü Yusuf KAYA
38 min read
Advanced1. GRPO Anatomi — PPO'dan Farkı#
| Component | PPO | GRPO |
|---|---|---|
| Actor (policy) | ✅ | ✅ |
| Critic (value head) | ✅ ek 1B param | ❌ |
| Reward model | ✅ ek 8B param | ❌ (verifiable rewards) |
| Group sampling | ❌ | ✅ (G=8-16 response/prompt) |
| Advantage estimation | GAE (V from critic) | Group mean baseline |
| Memory (FT) | 4 model birden | 2 model birden |
GRPO'nun simplification'ı: Value head'i atma. Bir prompt için 8-16 response sample et. Group içindeki ortalama reward'u baseline olarak kullan. Verifiable reward var, RM gerek yok.
2. GRPO Objective#
J(θ) = E[1/G · Σ_i min(π_θ(y_i)/π_old(y_i) · A_i, clip(...) · A_i)] - β · KL(π_θ || π_ref) A_i = (r_i - mean(r_1, ..., r_G)) / std(r_1, ..., r_G)
- = grup boyutu (DeepSeek 64 kullandı; cookbook RTX 4090'da 4-8)
G - = i. response'un verifiable reward'u
r_i - = group-normalized advantage
A_i - KL'i SFT/ref'e karşı tut
3. Verifiable Rewards — Reasoning RL'in Kalbi#
def math_reward(prompt, response, gold_answer): """GSM8K-style math reward — regex extract + compare.""" # Extract final numerical answer import re match = re.search(r"####\s*(-?\d+(?:\.\d+)?)", response) if not match: return -1.0 # no answer format pred = float(match.group(1)) if abs(pred - float(gold_answer)) < 1e-6: return 1.0 # correct return -0.5 # wrong def code_reward(prompt, response, test_cases): """Code reward — execute response, check test cases pass.""" code = extract_code_block(response) if not code: return -1.0 try: passed = run_test_cases(code, test_cases, timeout=5) return passed / len(test_cases) # fraction passed except: return -1.0 def format_reward(response): """Format adherence — does it have <think>...</think>?""" if "<think>" in response and "</think>" in response: return 0.2 return 0.0 # Combined def combined_reward(prompt, response, gold, tests): return math_reward(prompt, response, gold) + format_reward(response)
python
# === GRPO Lab — Qwen 2.5 7B + GSM8K + RTX 4090 ===# Cookbook'un en advanced Lab'larından biriimport torchfrom unsloth import FastLanguageModelfrom trl import GRPOConfig, GRPOTrainerfrom datasets import load_dataset model, tok = FastLanguageModel.from_pretrained( "unsloth/Qwen2.5-7B-Instruct-bnb-4bit", max_seq_length=2048, # GRPO için kısa dtype="bfloat16", load_in_4bit=True,)model = FastLanguageModel.get_peft_model( model, r=64, lora_alpha=128, lora_dropout=0.0, target_modules=["q_proj","k_proj","v_proj","o_proj"], use_gradient_checkpointing="unsloth",) # Dataset — GSM8Kdataset = load_dataset("openai/gsm8k", "main", split="train") # Reward functiondef reward_func(prompts, completions, **kwargs): """Her completion için reward döndür.""" rewards = [] for i, completion in enumerate(completions): gold = kwargs["gold_answer"][i] import re match = re.search(r"####\s*(-?\d+)", completion) if match and int(match.group(1)) == gold: rewards.append(1.0) else: rewards.append(-0.5) return rewards cfg = GRPOConfig( output_dir="qwen-7b-grpo-math", num_train_epochs=1, per_device_train_batch_size=1, gradient_accumulation_steps=4, num_generations=4, # G = 4 (4090 için) learning_rate=5e-6, bf16=True, optim="paged_adamw_8bit", max_prompt_length=512, max_completion_length=1024, beta=0.04, # KL coef logging_steps=5, save_steps=100, report_to="wandb",) trainer = GRPOTrainer( model=model, args=cfg, reward_funcs=[reward_func], train_dataset=dataset, tokenizer=tok,)trainer.train() # Bench:# - GSM8K accuracy: Qwen 7B base 85.4 → GRPO 91.2 (+5.8)# - Wall-clock: 6-8 saat, RTX 4090 + 4 generations/prompt# - Peak GB: 13.5 (multi-sample memory)GRPO Lab — Qwen 7B + GSM8K + RTX 4090
✅ Teslim
- GRPO Lab'ı koş. 2) GSM8K base vs post-GRPO accuracy ölç. 3) Sonraki ders: 11.8 — Reward Function Engineering.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations