Back to full roadmap

topiccore

LLM-as-Judge for Trajectory Eval

A strong model scores the agent trajectory across 5-7 dimensions — replaces manual review.

3 hours1 prereqs

Manual trajectory review is impossible (10 steps × 100 tasks = 1000 reviews). LLM-as-judge:

Score final outcome success/fail
Rate each tool call "necessary / sufficient / efficient"
Score the trajectory as a whole "natural / convoluted"
Categorize failure reasons (wrong tool, hallucination, infinite loop, etc.)

Pitfall: judge bias — may favor trajectories similar to its own model. Multiple judges + averaging.

Prerequisites

Agent Eval Fundamentals

Single-shot LLM eval ≠ agent eval. Measure trajectory + outcome + efficiency together.

General Agent Benchmarks

Production Observability Stack

Open the full interactive roadmap