Skip to content

Online Eval: Judge LLM + Win-Rate Dashboard + Regression Alarms

Real-time model quality measurement in production: Judge LLM (GPT-4o-mini / Llama 3.3 70B) scores every Nth response, win-rate v2 vs v1 dashboard, regression alarms. Open eval kits: PromptFoo, DeepEval, RAGAs. Cookbook's eval suite: daily snapshot + weekly aggregate + alarm if regress > 3 points.

Şükrü Yusuf KAYA
26 min read
Advanced
Online Eval: Judge LLM + Win-Rate Dashboard + Regression Alarms
python
# === Online eval — Judge LLM scoring ===
def judge_response(query, response, model="gpt-4o-mini"):
judge_prompt = f"""Aşağıdaki cevabın kalitesini 1-10 arası değerlendir.
Kriterler: doğruluk, ilgililik, dilbilgisi, kısalık.
 
Soru: {query}
Cevap: {response}
 
Sadece sayı dön (1-10):"""
score = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": judge_prompt}],
).choices[0].message.content
return int(score.strip())
 
# Production sampling — her 100. response'u judge'la
import random
def maybe_judge(query, response):
if random.random() < 0.01: # %1 sample
score = judge_response(query, response)
log_to_metrics({"judge_score": score, "model_version": "v2"})
Online judge eval sampling
✅ Teslim
  1. PromptFoo veya DeepEval kur. 2) Production sample %1 judge eval. 3) Sonraki ders: 16.4 — Drift Detection.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content