Skip to content
Back to full roadmap
topiccore

LLM-as-Judge

Use a stronger model to score outputs — 10× faster than manual review, 80% accurate.

3 hours1 resources1 prereqs

Score the output 1-5 against criteria like "accuracy", "tone", "instruction following". Pairwise comparison ("A or B?") is more reliable.

Pitfalls:

  • Position bias (favors first item)
  • Length bias (favors longer answer)
  • Self-preference (favors its own model)

→ Randomize positions, write criteria clearly, use multiple judges.

Prerequisites

Resources(1)