GRPO RL Stage: Math + Code Reward — Convergence Numbers (Qwen-7B + GSM8K +5-8%)

Reasoning model's last stage: GRPO with RL. GRPO with math correctness + code execution rewards on top of SFT base. Reward shaping (correctness 1.0, format 0.2, length penalty 0.001), advantage normalization, KL constraint. Qwen 2.5 7B-Instruct + GSM8K on RTX 4090: 6-8h, accuracy +5-8%.

Şükrü Yusuf KAYA

30 min read

5/14/2026

Advanced

GRPO RL Stage: Math + Code Reward — Convergence Sayıları (Qwen-7B + GSM8K +%5-8)

1. GRPO Convergence Sayıları (Cookbook ölçümleri)#

Qwen 2.5 7B-Instruct + GSM8K + GRPO (cookbook Part XI Ders 11.7 reference):

Step	GSM8K accuracy	Avg reward
0 (base)	85.4	0.85
100	86.8	0.87
200	88.2	0.89
400	89.5	0.91
800	90.6	0.93
1500	91.2	0.94
3000	91.5	0.94 (plateau)

Convergence pattern:

İlk 200 step: hızlı improvement
200-800: orta tempo
800-1500: yavaşlama
1500+: plateau, marginal

Cookbook'un kuralı: GRPO'da 1500-2000 step yeter. Daha fazla compute'u DPO veya başka domain'e harca.

✅ Teslim

Part XI 11.7'deki GRPO Lab'ı çalıştır. 2) Convergence eğrisini gör. 3) Sonraki ders: 12.5 — Long-CoT Stability.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

GRPO RL Stage: Math + Code Reward — Convergence Numbers (Qwen-7B + GSM8K +5-8%)

1. GRPO Convergence Sayıları (Cookbook ölçümleri)#

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter