Reward Function Engineering: Verifiable, Math, Code, Format, Length, Diversity

Reward function = definition of success for GRPO/PPO. Math (regex/SymPy), code (exec + test), format (chat template adherence), length (anti-rambling), diversity (n-gram penalty), composability. Cookbook's reward function design guide.

Şükrü Yusuf KAYA

28 min read

6/24/2026

Advanced

Reward Function Engineering: Verifiable, Math, Code, Format, Length, Diversity

1. Reward Kategorileri#

A. Verifiable correctness#

Math: regex extract + numerical compare ya da SymPy AST equality
Code: execute + assert test cases
Structured output: JSON schema validation
Format: chat template adherence (, etc.)

B. Continuous quality (less verifiable)#

Length penalty (anti-rambling)
Diversity bonus (anti-repetition)
Coherence (perplexity by judge model)
Style adherence (formality, etc.)

C. Safety#

Toxicity penalty
Refusal accuracy (good refusal vs bad refusal)

python

# === Math reward — SymPy ile ===
import sympy
 
def math_reward_sympy(prompt, response, gold_expr):
    """SymPy ile expression equality — daha sağlam."""
    try:
        # Extract final answer
        import re
        match = re.search(r"####\s*([^\n]+)", response)
        if not match:
            return -1.0
 
        pred_str = match.group(1).strip()
        pred = sympy.sympify(pred_str)
        gold = sympy.sympify(gold_expr)
 
        # Numerical compare (also handles 2/4 == 0.5)
        if sympy.simplify(pred - gold) == 0:
            return 1.0
        return -0.5
    except:
        return -1.0   # parse error
 
# Examples:
# pred="0.5", gold="1/2" → reward = 1.0 (equal)
# pred="x+1", gold="1+x" → reward = 1.0 (algebraic equal)

SymPy-based math reward

python

# === Code execution reward ===
import subprocess, tempfile, os
 
def code_exec_reward(prompt, response, test_cases, timeout=5):
    """Response içindeki Python code'u extract et, test cases'le çalıştır."""
    # Extract code
    import re
    code_match = re.search(r"\\`\\`\\`python\n(.*?)\\`\\`\\`", response, re.DOTALL)
    if not code_match:
        return -1.0
    code = code_match.group(1)
 
    # Write + run + collect passes
    with tempfile.NamedTemporaryFile(suffix=".py", delete=False, mode="w") as f:
        f.write(code + "\n\n")
        for test in test_cases:
            f.write(test + "\n")
        tmp_path = f.name
 
    try:
        result = subprocess.run(
            ["python", tmp_path],
            capture_output=True, text=True, timeout=timeout,
        )
        passed = result.returncode == 0
    except subprocess.TimeoutExpired:
        passed = False
    finally:
        os.unlink(tmp_path)
 
    return 1.0 if passed else -0.5
 
# Sandbox: production'da Docker container içinde çalıştırılır
# Cookbook'un Part XVII'sinde detay

code execution reward (sandboxed)

2. Composable Reward — Birden Fazla Sinyal#

def composable_reward(prompt, response, gold, tests):
    correctness = math_reward(prompt, response, gold)      # -1 to 1
    format_bonus = 0.2 if "<think>" in response and "</think>" in response else 0
    length_penalty = -0.001 * max(0, len(response.split()) - 500)
    diversity_bonus = 0.1 if len(set(response.split())) / len(response.split()) > 0.6 else 0
    return correctness + format_bonus + length_penalty + diversity_bonus

Tasarım kuralları:

Correctness > everything else — büyük ağırlık (10-100x)
Format/length/diversity — küçük çıkıcı bonus/penalti (0.1-0.5)
Normalize — büyük outlier reward'lar gradient'ı patlatır
Smooth — diskret 0/1 yerine gradient (örn. 0.0, 0.3, 0.7, 1.0)

✅ Teslim

GSM8K için composable reward yaz (correctness + format + length). 2) GRPO'da kullan. 3) Sonraki ders: 11.9 — Process Reward Models.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Reward Function Engineering: Verifiable, Math, Code, Format, Length, Diversity

1. Reward Kategorileri#

A. Verifiable correctness#

B. Continuous quality (less verifiable)#

C. Safety#

2. Composable Reward — Birden Fazla Sinyal#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter