Is LLM-as-judge really reliable? Isn't there bias?

Production Evaluation Framework: From Test Set Design to LLM-as-Judge — Build Your Turkish Eval System

Building production-grade LLM evaluation framework: test set design (sampling strategy, edge cases, adversarial), automated eval pipeline (pytest-like setup), LLM-as-a-judge strategies (GPT-4o vs Claude vs ensemble, bias detection), error analysis (clustering, root cause), A/B testing protocols (statistical significance, sample size). Objective comparison of 7 production artifacts from Modules 15-20. Clean evaluation code with Python + Pydantic.

Şükrü Yusuf KAYA

85 min read

5/13/2026

Advanced

Production Evaluation Framework: Test Set Design'dan LLM-as-Judge'a — Kendi Türkçe Eval Sistemi Kur

🎯 Production Eval — Modelinizin Gerçekten İyi Olduğunu Nasıl Bilirsin?

Public benchmark'lar (MMLU, Arena) rehberlik eder ama production'da kararı kendi test set'iniz verir. Niye?

Public benchmark genel — sizin domain'in spesifik
Public benchmark statik — kullanıcı sorular dinamik
Public benchmark contamination — sizinki fresh
Public benchmark İngilizce — sizinki Türkçe

Production LLM Eval framework kurmak 3 unsur:

(1) Test set design: 100-500 representative Türkçe soru, edge case'ler, adversarial örnekler. Sampling strategy. Sürekli güncel tutmak.

(2) Automated eval pipeline: pytest-like setup. Her commit'te otomatik çalışıyor. LLM-as-judge ile cevapları puanlıyor. Regression catch ediyor.

(3) A/B testing: production'da %5 traffic yeni model'e, sonuçları istatistiksel olarak karşılaştır. Sample size hesabı, significance testing.

Bu ders kendi Türkçe eval framework'ünü kurmayı öğretiyor. Modül 15-20'deki 7 production artefakt'ını objektif olarak karşılaştıracak araç inşa ediyoruz. 85 dakikada: production-grade evaluation engineer skill'i.

Bu Derste Neler Var? (11 Bölüm)#

Test set design — sampling, edge cases, adversarial
Türkçe test set kurulumu — 200 soru example
Automated eval pipeline — pytest-tarzı setup
LLM-as-a-judge — bias, ensemble, calibration
GPT-4o vs Claude vs Llama ensemble judge
Error analysis — clustering, root cause
A/B testing protokolleri — statistical significance
Sample size hesabı — power analysis
Production monitoring + alerting
7 artefakt karşılaştırma — Modül 15-20 review
Egzersizler

python

# Production Türkçe LLM Evaluation Framework — Pydantic + pytest tarzı
from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from openai import OpenAI
from enum import Enum
import json
 
client = OpenAI()
 
# 1. Test örneği modeli
class TestExample(BaseModel):
    id: str
    category: Literal['greeting', 'reasoning', 'math', 'coding', 'cultural', 'edge_case']
    question_tr: str
    expected_traits: List[str] = Field(..., description='Beklenen cevap özellikleri')
    difficulty: Literal['easy', 'medium', 'hard']
 
class EvalResult(BaseModel):
    test_id: str
    model_response: str
    score: float = Field(..., ge=0, le=10)
    reasoning: str
    failed_traits: List[str] = []
 
# 2. Türkçe test set örneği (200 sorudan 5 göstereli)
test_set = [
    TestExample(
        id='greet_01',
        category='greeting',
        question_tr='Merhaba, nasılsın?',
        expected_traits=['Türkçe doğal yanıt', 'samimi ton', 'kullanıcıya soru geri'],
        difficulty='easy',
    ),
    TestExample(
        id='reason_01',
        category='reasoning',
        question_tr='Bir kutuda 5 kırmızı, 3 mavi, 2 yeşil top var. Rastgele 1 top çekiyorum, kırmızı çıkmazsa geri koyup tekrar çekiyorum. İkinci çekişte kırmızı çıkma olasılığı?',
        expected_traits=['adım adım çözüm', 'doğru olasılık hesabı', 'koşullu olasılık'],
        difficulty='medium',
    ),
    TestExample(
        id='cultural_01',
        category='cultural',
        question_tr='Türk kahvesi nasıl pişirilir? Önemli ipuçları neler?',
        expected_traits=['cezve kullanımı', 'soğuk su', 'kısık ateş', 'köpük', 'şeker seçeneği'],
        difficulty='easy',
    ),
    TestExample(
        id='edge_01',
        category='edge_case',
        question_tr='3 yıl önce annem 27 yaşındaydı. Şu an ben 5 yaşındayım. Şimdi babamın yaşı ne?',
        expected_traits=['bilgi eksik', 'sormak', 'varsayım yapmamak'],
        difficulty='hard',
    ),
    TestExample(
        id='math_01',
        category='math',
        question_tr='log_2(16) + log_3(27) = ?',
        expected_traits=['log_2(16)=4', 'log_3(27)=3', 'toplam 7'],
        difficulty='medium',
    ),
]
 
# 3. LLM-as-a-Judge (GPT-4o)
def llm_judge(test: TestExample, model_response: str) -> EvalResult:
    judge_prompt = f'''Türkçe LLM çıktısını değerlendir.
 
Soru: {test.question_tr}
Kategori: {test.category}
Beklenen özellikler:
{chr(10).join(f'- {t}' for t in test.expected_traits)}
 
Model cevabı: {model_response}
 
Değerlendirme JSON formatında:
{{
  "score": 0-10,
  "reasoning": "detaylı analiz",
  "failed_traits": ["karşılanmayan özellikler"]
}}
 
Kurallar:
- 10: tüm beklenen özellikler karşılandı + Türkçe doğal
- 7-9: çoğu özellik karşılandı
- 4-6: yarı yarıya
- 1-3: çoğu eksik
- 0: tamamen başarısız (yanlış dil, hata, halüsinasyon)
'''
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {'role': 'system', 'content': 'Sen titiz bir Türkçe LLM hakemisin. JSON döndürürsün.'},
            {'role': 'user', 'content': judge_prompt},
        ],
        response_format={'type': 'json_object'},
    )
    
    judge_data = json.loads(response.choices[0].message.content)
    return EvalResult(
        test_id=test.id,
        model_response=model_response,
        score=judge_data['score'],
        reasoning=judge_data['reasoning'],
        failed_traits=judge_data.get('failed_traits', []),
    )
 
# 4. Eval pipeline
def evaluate_model(model_name: str, test_set: List[TestExample]) -> List[EvalResult]:
    results = []
    for test in test_set:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {'role': 'system', 'content': 'Sen yardımsever bir Türkçe asistansın.'},
                {'role': 'user', 'content': test.question_tr},
            ],
        )
        model_response = response.choices[0].message.content
        result = llm_judge(test, model_response)
        results.append(result)
        print(f'{test.id}: {result.score:.1f}/10')
    return results
 
# 5. Karşılaştırma
models_to_compare = ['gpt-4o', 'gpt-4o-mini', 'claude-3-5-sonnet-20241022']
 
for model in models_to_compare:
    print(f'\n=== {model} ===')
    results = evaluate_model(model, test_set)
    avg_score = sum(r.score for r in results) / len(results)
    print(f'Ortalama: {avg_score:.2f}/10')
    
    # Kategori bazlı
    by_category = {}
    for test, result in zip(test_set, results):
        by_category.setdefault(test.category, []).append(result.score)
    
    for cat, scores in by_category.items():
        print(f'  {cat}: {sum(scores)/len(scores):.2f}')

Türkçe LLM Evaluation Framework — Pydantic Production

✅ Ders 21.2 Özeti — Production Eval Framework

Production eval = kendi test set'iniz. 3 unsur: test set design (200 representative Türkçe soru, edge cases, adversarial), automated pipeline (pytest-tarzı setup, Pydantic models, LLM-as-judge), A/B testing (statistical significance, sample size). LLM-as-judge: GPT-4o tek başına yetmez, ensemble (GPT-4o + Claude) bias azaltır. Modül 15-20'deki 7 artefakt bu framework ile objektif karşılaştırılabilir. Sonraki ders capstone: kendi Türkçe LLM benchmark'ı yayınla — TR-LLMArena.

Sonraki Ders: Capstone TR-LLMArena#

Ders 21.3'te Modül 21 capstone: Türkçe LLM Arena kurmak. Community-driven, Türkçe-spesifik LMSys benzeri leaderboard. Çift-anonimous A/B vote, ELO ranking, monthly leaderboard. HuggingFace Spaces ile deploy. Türkçe AI ekosistemine somut katkı.

Frequently Asked Questions

**Bias exists but manageable**: **Proven biases**: - **Self-preference**: GPT-4 scores its own responses higher (~%5-10 bias) - **Verbosity**: long responses get higher scores (humans do this) - **Position bias**: 'A vs B' favors A (~%3 bias) - **Format bias**: structured output (bullet points) gets higher scores **Mitigation strategies**: 1. **Ensemble**: GPT-4o + Claude 3.5 Sonnet average → bias decreases 2. **Position swap**: test same pair as A-B and B-A 3. **Multi-criteria**: instead of single score, multiple criteria (accuracy, fluency, appropriateness) 4. **Human validation sample**: human + LLM judge comparison on 50-100 examples → calibration LLM-as-judge achieves **%85+ human correlation** with mature ensemble. Reliable enough, but not 'blind trust'.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Production Evaluation Framework: From Test Set Design to LLM-as-Judge — Build Your Turkish Eval System

Bu Derste Neler Var? (11 Bölüm)#

Sonraki Ders: Capstone TR-LLMArena#

Frequently Asked Questions

Is LLM-as-judge really reliable? Isn't there bias?

Yorumlar & Soru-Cevap

Related Content

Who Is an LLM Engineer? The AI Engineering Career Ladder from Junior to Staff

Course Philosophy: Why This Path, Why This Order — The Skeleton of an 8-Month Curriculum

Workshop Setup: uv, PyTorch 2.5+, CUDA, WSL2, Mac MPS, Triton, FlashAttention, Nsight

Subscribe to Newsletter