Back to full roadmap
topiccore
Eval Dataset Design
50-200 real user inputs + expected outputs. 'Looks good to me' isn't eval.
4 hours1 resources
Steps:
- Collect 100-200 real questions from production traffic
- Manually write a 'golden answer' or define pass/fail criteria
- Run the set on every prompt change
- Inspect the diff — what's newly broken?
Without this discipline, prompt improvement becomes guesswork.