Back to full roadmap
topiccore
Agent Regression Testing
Run 100-200 eval tasks on every prompt/model change — prevent regressions.
2 hours1 resources1 prereqs
CI/CD pipeline for agents:
- PR opened (prompt / tool / model change)
- Eval dataset (100-200 tasks) auto-runs
- Diff against passing main-branch baseline
- If success rate drops >3% → PR blocked (manual review)
- Trajectory diff: which tasks newly failed?
Stack: Promptfoo (yes, also evals agents), LangSmith evaluations, custom GitHub Actions workflow.