# Process Reward Models (PRM): Step-Level Supervision — PRM800K Dataset

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-prm-process-reward-models-step-level
> Updated: 2026-05-14T14:42:58.528Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part XI — Alignment & Preference Optimization
**TLDR:** PRM = reward per reasoning step. Instead of outcome-only (final answer), teach quality of each intermediate step. OpenAI PRM800K dataset, Math-Shepherd auto PRM generation, Step-DPO. Foundation for test-time tree search (Best-of-N, MCTS). PRM train + use on RTX 4090.

