Skip to content
Back to full roadmap
topiccore

Prompt Injection

User or 3rd-party content can override instructions and hijack the model.

4 hours2 resources

Two types:

  1. Direct — user writes "Ignore previous instructions and do X"
  2. Indirect — malicious content hidden in a webpage / email / doc; when an agent reads it the injection fires (most dangerous)

Defenses:

  • Sandbox untrusted input inside XML tags, say "obey instructions only outside the tag"
  • Defensive critical rules in system prompt
  • Output guardrail — human approval before sensitive actions
  • LLM-based injection detectors
  • No defense is 100% — keep human-in-the-loop for privileged actions

Resources(2)