Back to full roadmap
topiccore
LLM-as-Judge for Trajectory Eval
A strong model scores the agent trajectory across 5-7 dimensions — replaces manual review.
3 hours1 prereqs
Manual trajectory review is impossible (10 steps × 100 tasks = 1000 reviews). LLM-as-judge:
- Score final outcome success/fail
- Rate each tool call "necessary / sufficient / efficient"
- Score the trajectory as a whole "natural / convoluted"
- Categorize failure reasons (wrong tool, hallucination, infinite loop, etc.)
Pitfall: judge bias — may favor trajectories similar to its own model. Multiple judges + averaging.