Back to full roadmap
topiccore
LLM-as-Judge
Use a stronger model to score outputs — 10× faster than manual review, 80% accurate.
3 hours1 resources1 prereqs
Score the output 1-5 against criteria like "accuracy", "tone", "instruction following". Pairwise comparison ("A or B?") is more reliable.
Pitfalls:
- Position bias (favors first item)
- Length bias (favors longer answer)
- Self-preference (favors its own model)
→ Randomize positions, write criteria clearly, use multiple judges.