Online Eval: Judge LLM + Win-Rate Dashboard + Regression Alarms
Real-time model quality measurement in production: Judge LLM (GPT-4o-mini / Llama 3.3 70B) scores every Nth response, win-rate v2 vs v1 dashboard, regression alarms. Open eval kits: PromptFoo, DeepEval, RAGAs. Cookbook's eval suite: daily snapshot + weekly aggregate + alarm if regress > 3 points.
Şükrü Yusuf KAYA
26 min read
Advancedpython
# === Online eval — Judge LLM scoring ===def judge_response(query, response, model="gpt-4o-mini"): judge_prompt = f"""Aşağıdaki cevabın kalitesini 1-10 arası değerlendir.Kriterler: doğruluk, ilgililik, dilbilgisi, kısalık. Soru: {query}Cevap: {response} Sadece sayı dön (1-10):""" score = openai.chat.completions.create( model=model, messages=[{"role": "user", "content": judge_prompt}], ).choices[0].message.content return int(score.strip()) # Production sampling — her 100. response'u judge'laimport randomdef maybe_judge(query, response): if random.random() < 0.01: # %1 sample score = judge_response(query, response) log_to_metrics({"judge_score": score, "model_version": "v2"})Online judge eval sampling
✅ Teslim
- PromptFoo veya DeepEval kur. 2) Production sample %1 judge eval. 3) Sonraki ders: 16.4 — Drift Detection.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
Part 0 — Engineering Foundations
Welcome to the Fine-Tuning Cookbook: System, Stage Taxonomy, and the Reproducibility Contract
Start LearningPart 0 — Engineering Foundations
Reproducibility Stack: Seeds, cuDNN Flags, and Deterministic CUDA — End the 'Works on My Machine' Problem
Start LearningPart 0 — Engineering Foundations