Reward Function Engineering: Verifiable, Math, Code, Format, Length, Diversity
Reward function = definition of success for GRPO/PPO. Math (regex/SymPy), code (exec + test), format (chat template adherence), length (anti-rambling), diversity (n-gram penalty), composability. Cookbook's reward function design guide.
Şükrü Yusuf KAYA
28 min read
Advanced1. Reward Kategorileri#
A. Verifiable correctness#
- Math: regex extract + numerical compare ya da SymPy AST equality
- Code: execute + assert test cases
- Structured output: JSON schema validation
- Format: chat template adherence (
, etc.)
B. Continuous quality (less verifiable)#
- Length penalty (anti-rambling)
- Diversity bonus (anti-repetition)
- Coherence (perplexity by judge model)
- Style adherence (formality, etc.)
C. Safety#
- Toxicity penalty
- Refusal accuracy (good refusal vs bad refusal)
python
# === Math reward — SymPy ile ===import sympy def math_reward_sympy(prompt, response, gold_expr): """SymPy ile expression equality — daha sağlam.""" try: # Extract final answer import re match = re.search(r"####\s*([^\n]+)", response) if not match: return -1.0 pred_str = match.group(1).strip() pred = sympy.sympify(pred_str) gold = sympy.sympify(gold_expr) # Numerical compare (also handles 2/4 == 0.5) if sympy.simplify(pred - gold) == 0: return 1.0 return -0.5 except: return -1.0 # parse error # Examples:# pred="0.5", gold="1/2" → reward = 1.0 (equal)# pred="x+1", gold="1+x" → reward = 1.0 (algebraic equal)SymPy-based math reward
python
# === Code execution reward ===import subprocess, tempfile, os def code_exec_reward(prompt, response, test_cases, timeout=5): """Response içindeki Python code'u extract et, test cases'le çalıştır.""" # Extract code import re code_match = re.search(r"\\`\\`\\`python\n(.*?)\\`\\`\\`", response, re.DOTALL) if not code_match: return -1.0 code = code_match.group(1) # Write + run + collect passes with tempfile.NamedTemporaryFile(suffix=".py", delete=False, mode="w") as f: f.write(code + "\n\n") for test in test_cases: f.write(test + "\n") tmp_path = f.name try: result = subprocess.run( ["python", tmp_path], capture_output=True, text=True, timeout=timeout, ) passed = result.returncode == 0 except subprocess.TimeoutExpired: passed = False finally: os.unlink(tmp_path) return 1.0 if passed else -0.5 # Sandbox: production'da Docker container içinde çalıştırılır# Cookbook'un Part XVII'sinde detaycode execution reward (sandboxed)
2. Composable Reward — Birden Fazla Sinyal#
def composable_reward(prompt, response, gold, tests): correctness = math_reward(prompt, response, gold) # -1 to 1 format_bonus = 0.2 if "<think>" in response and "</think>" in response else 0 length_penalty = -0.001 * max(0, len(response.split()) - 500) diversity_bonus = 0.1 if len(set(response.split())) / len(response.split()) > 0.6 else 0 return correctness + format_bonus + length_penalty + diversity_bonus
Tasarım kuralları:
- Correctness > everything else — büyük ağırlık (10-100x)
- Format/length/diversity — küçük çıkıcı bonus/penalti (0.1-0.5)
- Normalize — büyük outlier reward'lar gradient'ı patlatır
- Smooth — diskret 0/1 yerine gradient (örn. 0.0, 0.3, 0.7, 1.0)
✅ Teslim
- GSM8K için composable reward yaz (correctness + format + length). 2) GRPO'da kullan. 3) Sonraki ders: 11.9 — Process Reward Models.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations