# Mathematics of the Reward Model: From Bradley-Terry 1952 to Modern LLM Reward Architecture

> Source: https://sukruyusufkaya.com/en/learn/llm-muhendisligi/reward-model-matematik-bradley-terry-modern-mimari
> Updated: 2026-05-13T13:00:29.754Z
> Category: LLM Mühendisliği
> Module: Module 15: Preference Alignment — RLHF, PPO, DPO, GRPO
**TLDR:** Mathematical anatomy of the reward model — the heart of RLHF: derivation of Bradley-Terry 1952 logistic preference model, probabilistic interpretation of sigmoid, derivative of ranking loss, RM architectural choices (separate from SFT vs shared trunk + value head), calibration and overconfidence problems, Plackett-Luce extension for multiple comparisons, practical pitfalls of RM training for Turkish.

