topicadvanced

General Agent Benchmarks

GAIA, AgentBench, τ-bench, WebArena, OSWorld — standard benchmarks for different use cases.

3 hours3 resources1 prereqs

GAIA (Meta) — general assistant; web search + multi-modal + reasoning. τ-bench (Sierra) — customer service simulation; agent vs synthetic user. AgentBench — 8 environments (OS, DB, code, web, etc.); broad coverage. WebArena — realistic web tasks (shopping, mapping). OSWorld — desktop OS automation. SWE-Bench — coding (see prev).

Pick the closest benchmark for your use case and measure there. Custom benchmark = most accurate but costly.