Skip to content
Back to full roadmap
topicadvanced

Jailbreak Defense

DAN, role-play, hypothetical-scenario, encoded attacks — bypassing the model's safety training.

3 hours1 resources1 prereqs

Layered defense:

  • Model-level safety (the vendor already provides)
  • Input classifier (Llama Guard, Granite Guardian)
  • Output classifier (toxicity, harmful instructions)
  • Pattern-based pre-filter (known jailbreak phrases)
  • Behavior monitoring (anomalous usage patterns)

Continuously test with red-teaming.

Prerequisites

Resources(1)