Skip to content
Back to full roadmap
topiccore

Eval Dataset Design

50-200 real user inputs + expected outputs. 'Looks good to me' isn't eval.

4 hours1 resources

Steps:

  1. Collect 100-200 real questions from production traffic
  2. Manually write a 'golden answer' or define pass/fail criteria
  3. Run the set on every prompt change
  4. Inspect the diff — what's newly broken?

Without this discipline, prompt improvement becomes guesswork.

Resources(1)