Back to full roadmap

topiccore

Agent Eval Fundamentals

Single-shot LLM eval ≠ agent eval. Measure trajectory + outcome + efficiency together.

3 hours1 prereqs

Agent eval dimensions:

Task success — was the final goal met?
Trajectory quality — were the steps sensible?
Tool selection — picked the right tool?
Efficiency — how many steps, tokens, dollars?
Recovery — recovered after errors?
Cost — acceptable cost per task?
Latency — did the user get a result in acceptable time?
Safety — performed destructive actions unauthorized?

Each dimension needs its own metric + eval method. 100 tasks × 8 metrics matrix → agent quality dashboard.

Prerequisites

Trace Logging & Debugging

Log every agent step — model in/out, tool calls, latency, cost. Otherwise debugging is impossible.

Framework Comparison & Selection

SWE-Bench (Coding Agent Benchmark)

Open the full interactive roadmap