# GRPO (Group Relative Policy Optimization): DeepSeek-R1's Verifiable Reward Recipe

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-grpo-deepseek-r1-verifiable-reward
> Updated: 2026-05-14T14:42:58.346Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part XI — Alignment & Preference Optimization
**TLDR:** GRPO (DeepSeek 2024) — simplified variant of PPO. No critic/value head. Sample G different responses per batch, normalize **relative rewards** within group. Verifiable rewards (math correctness, code execution) enable reasoning RL. Qwen-7B + GRPO + GSM8K accuracy +5-8% on RTX 4090.

