AI Glossary

Jailbreak (LLM)

Techniques that bypass an LLM's safety guardrails to make it produce content it was trained to refuse, such as harmful instructions or offensive material.

Common Techniques

Role-playing prompts ('pretend you're an evil AI'), encoding tricks (base64, pig latin), many-shot prompting (overwhelming safety training with examples), multi-turn attacks (gradually escalating), and adversarial suffixes (optimized token sequences).

Defenses

Robust safety training, input/output classifiers, system prompt hardening, rate limiting, multi-model review chains, and red-teaming (proactively finding and patching vulnerabilities).

← Back to AI Glossary

Last updated: March 5, 2026