Skip to content
Back to full roadmap
topicadvanced

Agent Eval

Trajectory eval — evaluate the agent's INTERMEDIATE steps, not just the final answer.

3 hours2 resources2 prereqs

Different from single-shot LLM eval: agents make multiple tool calls and produce long trajectories. Eval dimensions:

  1. Task success — was the final goal met?
  2. Efficiency — how many steps, tokens, dollars?
  3. Tool selection — picked the right tool?
  4. Recovery — recovered after errors?
  5. Trajectory diversity — consistent across runs?

Benchmarks: SWE-Bench (coding), GAIA (general agents), AgentBench, τ-bench.

Prerequisites

Resources(2)

Agent Eval · Prompt Engineer Roadmap | Şükrü Yusuf Kaya