Skip to content

Reward Function Engineering: Verifiable, Math, Code, Format, Length, Diversity

Reward function = definition of success for GRPO/PPO. Math (regex/SymPy), code (exec + test), format (chat template adherence), length (anti-rambling), diversity (n-gram penalty), composability. Cookbook's reward function design guide.

Şükrü Yusuf KAYA
28 min read
Advanced
Reward Function Engineering: Verifiable, Math, Code, Format, Length, Diversity

1. Reward Kategorileri#

A. Verifiable correctness#

  • Math: regex extract + numerical compare ya da SymPy AST equality
  • Code: execute + assert test cases
  • Structured output: JSON schema validation
  • Format: chat template adherence (, etc.)

B. Continuous quality (less verifiable)#

  • Length penalty (anti-rambling)
  • Diversity bonus (anti-repetition)
  • Coherence (perplexity by judge model)
  • Style adherence (formality, etc.)

C. Safety#

  • Toxicity penalty
  • Refusal accuracy (good refusal vs bad refusal)
python
# === Math reward — SymPy ile ===
import sympy
 
def math_reward_sympy(prompt, response, gold_expr):
"""SymPy ile expression equality — daha sağlam."""
try:
# Extract final answer
import re
match = re.search(r"####\s*([^\n]+)", response)
if not match:
return -1.0
 
pred_str = match.group(1).strip()
pred = sympy.sympify(pred_str)
gold = sympy.sympify(gold_expr)
 
# Numerical compare (also handles 2/4 == 0.5)
if sympy.simplify(pred - gold) == 0:
return 1.0
return -0.5
except:
return -1.0 # parse error
 
# Examples:
# pred="0.5", gold="1/2" → reward = 1.0 (equal)
# pred="x+1", gold="1+x" → reward = 1.0 (algebraic equal)
SymPy-based math reward
python
# === Code execution reward ===
import subprocess, tempfile, os
 
def code_exec_reward(prompt, response, test_cases, timeout=5):
"""Response içindeki Python code'u extract et, test cases'le çalıştır."""
# Extract code
import re
code_match = re.search(r"\\`\\`\\`python\n(.*?)\\`\\`\\`", response, re.DOTALL)
if not code_match:
return -1.0
code = code_match.group(1)
 
# Write + run + collect passes
with tempfile.NamedTemporaryFile(suffix=".py", delete=False, mode="w") as f:
f.write(code + "\n\n")
for test in test_cases:
f.write(test + "\n")
tmp_path = f.name
 
try:
result = subprocess.run(
["python", tmp_path],
capture_output=True, text=True, timeout=timeout,
)
passed = result.returncode == 0
except subprocess.TimeoutExpired:
passed = False
finally:
os.unlink(tmp_path)
 
return 1.0 if passed else -0.5
 
# Sandbox: production'da Docker container içinde çalıştırılır
# Cookbook'un Part XVII'sinde detay
code execution reward (sandboxed)

2. Composable Reward — Birden Fazla Sinyal#

def composable_reward(prompt, response, gold, tests): correctness = math_reward(prompt, response, gold) # -1 to 1 format_bonus = 0.2 if "<think>" in response and "</think>" in response else 0 length_penalty = -0.001 * max(0, len(response.split()) - 500) diversity_bonus = 0.1 if len(set(response.split())) / len(response.split()) > 0.6 else 0 return correctness + format_bonus + length_penalty + diversity_bonus
Tasarım kuralları:
  1. Correctness > everything else — büyük ağırlık (10-100x)
  2. Format/length/diversity — küçük çıkıcı bonus/penalti (0.1-0.5)
  3. Normalize — büyük outlier reward'lar gradient'ı patlatır
  4. Smooth — diskret 0/1 yerine gradient (örn. 0.0, 0.3, 0.7, 1.0)
✅ Teslim
  1. GSM8K için composable reward yaz (correctness + format + length). 2) GRPO'da kullan. 3) Sonraki ders: 11.9 — Process Reward Models.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content