# GRPO RL Stage: Math + Code Reward — Convergence Numbers (Qwen-7B + GSM8K +5-8%)

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-grpo-rl-math-code-reward
> Updated: 2026-05-14T14:42:59.046Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part XII — Reasoning Model FT (R1-style)
**TLDR:** Reasoning model's last stage: GRPO with RL. GRPO with math correctness + code execution rewards on top of SFT base. Reward shaping (correctness 1.0, format 0.2, length penalty 0.001), advantage normalization, KL constraint. Qwen 2.5 7B-Instruct + GSM8K on RTX 4090: 6-8h, accuracy +5-8%.

