Skip to content
Back to full roadmap
topiccore

Agent Eval Fundamentals

Single-shot LLM eval ≠ agent eval. Measure trajectory + outcome + efficiency together.

3 hours1 prereqs

Agent eval dimensions:

  1. Task success — was the final goal met?
  2. Trajectory quality — were the steps sensible?
  3. Tool selection — picked the right tool?
  4. Efficiency — how many steps, tokens, dollars?
  5. Recovery — recovered after errors?
  6. Cost — acceptable cost per task?
  7. Latency — did the user get a result in acceptable time?
  8. Safety — performed destructive actions unauthorized?

Each dimension needs its own metric + eval method. 100 tasks × 8 metrics matrix → agent quality dashboard.

Prerequisites