Online Eval: Judge LLM + Win-Rate Dashboard + Regression Alarms

Real-time model quality measurement in production: Judge LLM (GPT-4o-mini / Llama 3.3 70B) scores every Nth response, win-rate v2 vs v1 dashboard, regression alarms. Open eval kits: PromptFoo, DeepEval, RAGAs. Cookbook's eval suite: daily snapshot + weekly aggregate + alarm if regress > 3 points.

Şükrü Yusuf KAYA

26 min read

5/14/2026

Advanced

Online Eval: Judge LLM + Win-Rate Dashboard + Regression Alarms

python

# === Online eval — Judge LLM scoring ===
def judge_response(query, response, model="gpt-4o-mini"):
    judge_prompt = f"""Aşağıdaki cevabın kalitesini 1-10 arası değerlendir.
Kriterler: doğruluk, ilgililik, dilbilgisi, kısalık.
 
Soru: {query}
Cevap: {response}
 
Sadece sayı dön (1-10):"""
    score = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": judge_prompt}],
    ).choices[0].message.content
    return int(score.strip())
 
# Production sampling — her 100. response'u judge'la
import random
def maybe_judge(query, response):
    if random.random() < 0.01:        # %1 sample
        score = judge_response(query, response)
        log_to_metrics({"judge_score": score, "model_version": "v2"})

Online judge eval sampling

✅ Teslim

PromptFoo veya DeepEval kur. 2) Production sample %1 judge eval. 3) Sonraki ders: 16.4 — Drift Detection.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Online Eval: Judge LLM + Win-Rate Dashboard + Regression Alarms

Yorumlar & Soru-Cevap

Related Content

Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract

Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem

Environment Pinning: uv + pyproject.toml, CUDA Version Matrix, and Container Recipes

Subscribe to Newsletter