GRPO RL Stage: Math + Code Reward — Convergence Numbers (Qwen-7B + GSM8K +5-8%)
Reasoning model's last stage: GRPO with RL. GRPO with math correctness + code execution rewards on top of SFT base. Reward shaping (correctness 1.0, format 0.2, length penalty 0.001), advantage normalization, KL constraint. Qwen 2.5 7B-Instruct + GSM8K on RTX 4090: 6-8h, accuracy +5-8%.
Şükrü Yusuf KAYA
30 min read
Advanced1. GRPO Convergence Sayıları (Cookbook ölçümleri)#
Qwen 2.5 7B-Instruct + GSM8K + GRPO (cookbook Part XI Ders 11.7 reference):
| Step | GSM8K accuracy | Avg reward |
|---|---|---|
| 0 (base) | 85.4 | 0.85 |
| 100 | 86.8 | 0.87 |
| 200 | 88.2 | 0.89 |
| 400 | 89.5 | 0.91 |
| 800 | 90.6 | 0.93 |
| 1500 | 91.2 | 0.94 |
| 3000 | 91.5 | 0.94 (plateau) |
Convergence pattern:
- İlk 200 step: hızlı improvement
- 200-800: orta tempo
- 800-1500: yavaşlama
- 1500+: plateau, marginal
Cookbook'un kuralı: GRPO'da 1500-2000 step yeter. Daha fazla compute'u DPO veya başka domain'e harca.
✅ Teslim
- Part XI 11.7'deki GRPO Lab'ı çalıştır. 2) Convergence eğrisini gör. 3) Sonraki ders: 12.5 — Long-CoT Stability.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations