Back to full roadmap

topiccore

Agent Regression Testing

Run 100-200 eval tasks on every prompt/model change — prevent regressions.

2 hours1 resources1 prereqs

CI/CD pipeline for agents:

PR opened (prompt / tool / model change)
Eval dataset (100-200 tasks) auto-runs
Diff against passing main-branch baseline
If success rate drops >3% → PR blocked (manual review)
Trajectory diff: which tasks newly failed?

Stack: Promptfoo (yes, also evals agents), LangSmith evaluations, custom GitHub Actions workflow.

Prerequisites

Agent Eval Fundamentals

Single-shot LLM eval ≠ agent eval. Measure trajectory + outcome + efficiency together.

Resources(1)

GGitHub(1)

Production Observability Stack

✓ Eval Discipline Done

Open the full interactive roadmap