Tüm roadmap'e dön

topiccore

LLM-as-Judge ile Trajectory Eval

Güçlü model agent trajectory'sini 5-7 boyutta puanlar — manual review yerine.

3 saat1 önkoşul

Manual trajectory review imkansız (10 step × 100 task = 1000 review). LLM-as-judge:

Final outcome'u success/fail olarak skorla
Her tool call'unu "necessary / sufficient / efficient" diye puanla
Trajectory'i bir bütün olarak "natural / convoluted" skoru
Failure reason'ı kategorize et (wrong tool, hallucination, infinite loop, etc.)

Pitfall: judge bias — kendi modeline benzer trajectory'i tercih edebilir. Multiple judges + averaging.

Önce bunları bil

Agent Eval Temelleri

Tek-shot LLM eval ≠ agent eval. Trajectory + outcome + efficiency birlikte ölç.

Genel Agent Benchmark'ları

Production Observability Stack'i

Tüm roadmap'i interaktif görüntüle