Back to full roadmap
topiccore
Agent Eval Fundamentals
Single-shot LLM eval ≠ agent eval. Measure trajectory + outcome + efficiency together.
3 hours1 prereqs
Agent eval dimensions:
- Task success — was the final goal met?
- Trajectory quality — were the steps sensible?
- Tool selection — picked the right tool?
- Efficiency — how many steps, tokens, dollars?
- Recovery — recovered after errors?
- Cost — acceptable cost per task?
- Latency — did the user get a result in acceptable time?
- Safety — performed destructive actions unauthorized?
Each dimension needs its own metric + eval method. 100 tasks × 8 metrics matrix → agent quality dashboard.