AI Glossary

Model Evaluation

The systematic process of assessing an AI model's performance using metrics, test sets, and human judgment to determine if it meets quality and safety standards.

Automated Metrics

Classification: accuracy, F1, AUC-ROC. Generation: BLEU, ROUGE, perplexity. Code: pass@k, functional correctness. Comprehensive benchmarks: MMLU, HumanEval, GSM8K, ARC.

Human Evaluation

Side-by-side comparisons (which response is better?). Likert scale ratings on quality dimensions. Red-teaming for safety issues. Domain expert review for specialized applications. Chatbot arena (Elo ratings from blind user preferences).

Best Practices

Use held-out test sets never seen during training. Evaluate across diverse demographics and use cases. Monitor for data contamination (test data in training set). Track multiple metrics — no single number captures full performance. Continuously evaluate deployed models.

← Back to AI Glossary

Last updated: March 5, 2026