Back to full roadmap
topicadvanced
Jailbreak Defense
DAN, role-play, hypothetical-scenario, encoded attacks — bypassing the model's safety training.
3 hours1 resources1 prereqs
Layered defense:
- Model-level safety (the vendor already provides)
- Input classifier (Llama Guard, Granite Guardian)
- Output classifier (toxicity, harmful instructions)
- Pattern-based pre-filter (known jailbreak phrases)
- Behavior monitoring (anomalous usage patterns)
Continuously test with red-teaming.