# GRPO and Reasoning RL: Inside DeepSeek-R1 — From Group-Based Advantage Estimation to Process Reward

> Source: https://sukruyusufkaya.com/en/learn/llm-muhendisligi/grpo-reasoning-rl-deepseek-r1
> Updated: 2026-05-13T13:00:30.020Z
> Category: LLM Mühendisliği
> Module: Module 15: Preference Alignment — RLHF, PPO, DPO, GRPO
**TLDR:** GRPO (Group Relative Policy Optimization): DeepSeek's elegant simplification of PPO. Advantage estimation without value function, group comparison, computational efficiency. Anatomy of DeepSeek-R1 paper (Jan 2025), RL ordering of reasoning training, 'aha moments' phenomenon, role of process reward models, o1 vs R1 architecture comparison, practical notes for Turkish reasoning model.

