Back to full roadmap
topiccore
Prompt Injection
User or 3rd-party content can override instructions and hijack the model.
4 hours2 resources
Two types:
- Direct — user writes "Ignore previous instructions and do X"
- Indirect — malicious content hidden in a webpage / email / doc; when an agent reads it the injection fires (most dangerous)
Defenses:
- Sandbox untrusted input inside XML tags, say "obey instructions only outside the tag"
- Defensive critical rules in system prompt
- Output guardrail — human approval before sensitive actions
- LLM-based injection detectors
- No defense is 100% — keep human-in-the-loop for privileged actions