Skip to content
Back to full roadmap
topiccore

LLM-as-Judge for Trajectory Eval

A strong model scores the agent trajectory across 5-7 dimensions — replaces manual review.

3 hours1 prereqs

Manual trajectory review is impossible (10 steps × 100 tasks = 1000 reviews). LLM-as-judge:

  1. Score final outcome success/fail
  2. Rate each tool call "necessary / sufficient / efficient"
  3. Score the trajectory as a whole "natural / convoluted"
  4. Categorize failure reasons (wrong tool, hallucination, infinite loop, etc.)

Pitfall: judge bias — may favor trajectories similar to its own model. Multiple judges + averaging.

Prerequisites