Jailbreak (LLM)
Techniques that bypass an LLM's safety guardrails to make it produce content it was trained to refuse, such as harmful instructions or offensive material.
Common Techniques
Role-playing prompts ('pretend you're an evil AI'), encoding tricks (base64, pig latin), many-shot prompting (overwhelming safety training with examples), multi-turn attacks (gradually escalating), and adversarial suffixes (optimized token sequences).
Defenses
Robust safety training, input/output classifiers, system prompt hardening, rate limiting, multi-model review chains, and red-teaming (proactively finding and patching vulnerabilities).