Back to full roadmap

topicadvanced

Agent Eval

Trajectory eval — evaluate the agent's INTERMEDIATE steps, not just the final answer.

3 hours2 resources2 prereqs

Different from single-shot LLM eval: agents make multiple tool calls and produce long trajectories. Eval dimensions:

Task success — was the final goal met?
Efficiency — how many steps, tokens, dollars?
Tool selection — picked the right tool?
Recovery — recovered after errors?
Trajectory diversity — consistent across runs?

Benchmarks: SWE-Bench (coding), GAIA (general agents), AgentBench, τ-bench.

Prerequisites

Eval Dataset Design

50-200 real user inputs + expected outputs. 'Looks good to me' isn't eval.

Agentic Loop Architecture

while(!done) { think → act → observe → update_state } — backbone of modern agents.

Resources(2)

GGitHub(2)

SWE-Bench (coding agent benchmark)

GAIA (general assistant)

Video Understanding

Agentic IDEs (Claude Code, Cursor)

Open the full interactive roadmap