Back to full roadmap
topicadvanced
Agent Eval
Trajectory eval — evaluate the agent's INTERMEDIATE steps, not just the final answer.
3 hours2 resources2 prereqs
Different from single-shot LLM eval: agents make multiple tool calls and produce long trajectories. Eval dimensions:
- Task success — was the final goal met?
- Efficiency — how many steps, tokens, dollars?
- Tool selection — picked the right tool?
- Recovery — recovered after errors?
- Trajectory diversity — consistent across runs?
Benchmarks: SWE-Bench (coding), GAIA (general agents), AgentBench, τ-bench.