Skip to content
Back to full roadmap
topicadvanced

SWE-Bench (Coding Agent Benchmark)

Real GitHub issues → agent must fix → tests must pass. Gold standard for coding agents.

3 hours2 resources

Princeton's SWE-Bench (2024): real GitHub issue+fix pairs from Python projects. Agent gets the issue, expected to fix the code. Automatic test execution gives pass/fail.

Variants:

  • SWE-Bench Verified — human-validated 500 tasks
  • SWE-Bench Lite — smaller subset, fast eval
  • SWE-Bench Multimodal — UI + code

SOTA (2025): Claude 4 Sonnet + agentic scaffold ~70%. It was 4% in 2023. Expected to hit 85%+ in 2026 at this rate.

Resources(2)