Safety Evaluation
Systematic testing of AI models for harmful outputs, vulnerabilities, and alignment with safety requirements.
Overview
Safety evaluation is the systematic process of testing AI models for potential harms before and after deployment. It includes red-teaming (adversarial testing), benchmark evaluation (TruthfulQA, BBQ, toxicity tests), automated testing pipelines, and human evaluation of model outputs across sensitive domains.
Framework
Modern safety evaluation covers: Content safety: Harmful, illegal, or inappropriate outputs. Fairness: Bias across demographics. Robustness: Behavior under adversarial inputs. Privacy: Data leakage and memorization. Security: Prompt injection and jailbreak resistance. Organizations like NIST, UK AISI, and MLCommons develop standardized safety evaluation frameworks.