Back to full roadmap

topicadvanced

Jailbreak Defense

DAN, role-play, hypothetical-scenario, encoded attacks — bypassing the model's safety training.

3 hours1 resources1 prereqs

Layered defense:

Model-level safety (the vendor already provides)
Input classifier (Llama Guard, Granite Guardian)
Output classifier (toxicity, harmful instructions)
Pattern-based pre-filter (known jailbreak phrases)
Behavior monitoring (anomalous usage patterns)

Continuously test with red-teaming.

Prerequisites

Prompt Injection

User or 3rd-party content can override instructions and hijack the model.

Resources(1)

GGitHub(1)

Prompt Injection

PII Detection & Redaction

Open the full interactive roadmap