Safety Filter
A mechanism that screens AI model inputs and outputs to prevent harmful content, typically using classifiers, rule-based systems, or secondary models.
Implementation
Input filters: Block or modify harmful prompts before they reach the model. Output filters: Screen generated content before showing it to users. Classifier-based: Trained models that detect harmful categories. LLM-based: Use a secondary LLM to evaluate safety.
Challenges
Balancing safety with utility (over-filtering makes the model useless). Adversarial bypass techniques. Context-dependent harm (medical discussions vs harmful instructions). Cultural and linguistic variation in what constitutes harm.