# Online Eval: Judge LLM + Win-Rate Dashboard + Regression Alarms

> Source: https://sukruyusufkaya.com/en/learn/fine-tuning-cookbook/ftc-online-eval-judge-llm-winrate
> Updated: 2026-05-14T14:43:01.916Z
> Category: Fine-Tuning Cookbook (Model-by-Model)
> Module: Part XVI — Production Operations
**TLDR:** Real-time model quality measurement in production: Judge LLM (GPT-4o-mini / Llama 3.3 70B) scores every Nth response, win-rate v2 vs v1 dashboard, regression alarms. Open eval kits: PromptFoo, DeepEval, RAGAs. Cookbook's eval suite: daily snapshot + weekly aggregate + alarm if regress > 3 points.

