Model Evaluation
The systematic process of assessing an AI model's performance using metrics, test sets, and human judgment to determine if it meets quality and safety standards.
Automated Metrics
Classification: accuracy, F1, AUC-ROC. Generation: BLEU, ROUGE, perplexity. Code: pass@k, functional correctness. Comprehensive benchmarks: MMLU, HumanEval, GSM8K, ARC.
Human Evaluation
Side-by-side comparisons (which response is better?). Likert scale ratings on quality dimensions. Red-teaming for safety issues. Domain expert review for specialized applications. Chatbot arena (Elo ratings from blind user preferences).
Best Practices
Use held-out test sets never seen during training. Evaluate across diverse demographics and use cases. Monitor for data contamination (test data in training set). Track multiple metrics — no single number captures full performance. Continuously evaluate deployed models.