Skip to content
Back to full roadmap
topiccore

Agent Regression Testing

Run 100-200 eval tasks on every prompt/model change — prevent regressions.

2 hours1 resources1 prereqs

CI/CD pipeline for agents:

  1. PR opened (prompt / tool / model change)
  2. Eval dataset (100-200 tasks) auto-runs
  3. Diff against passing main-branch baseline
  4. If success rate drops >3% → PR blocked (manual review)
  5. Trajectory diff: which tasks newly failed?

Stack: Promptfoo (yes, also evals agents), LangSmith evaluations, custom GitHub Actions workflow.

Prerequisites

Resources(1)