topicadvanced

SWE-Bench (Coding Agent Benchmark)

Real GitHub issues → agent must fix → tests must pass. Gold standard for coding agents.

3 hours2 resources

Princeton's SWE-Bench (2024): real GitHub issue+fix pairs from Python projects. Agent gets the issue, expected to fix the code. Automatic test execution gives pass/fail.

Variants:

SWE-Bench Verified — human-validated 500 tasks
SWE-Bench Lite — smaller subset, fast eval
SWE-Bench Multimodal — UI + code

SOTA (2025): Claude 4 Sonnet + agentic scaffold ~70%. It was 4% in 2023. Expected to hit 85%+ in 2026 at this rate.

Resources(2)

GGitHub(1)

SWE-Bench

· en

free

AArticle(1)

SWE-Bench leaderboard

· en

free

Agent Eval Fundamentals

General Agent Benchmarks

Open the full interactive roadmap