AI Glossary

Safety Filter

A mechanism that screens AI model inputs and outputs to prevent harmful content, typically using classifiers, rule-based systems, or secondary models.

Implementation

Input filters: Block or modify harmful prompts before they reach the model. Output filters: Screen generated content before showing it to users. Classifier-based: Trained models that detect harmful categories. LLM-based: Use a secondary LLM to evaluate safety.

Challenges

Balancing safety with utility (over-filtering makes the model useless). Adversarial bypass techniques. Context-dependent harm (medical discussions vs harmful instructions). Cultural and linguistic variation in what constitutes harm.

← Back to AI Glossary

Last updated: March 5, 2026