Evaluation Metric
A quantitative measure used to assess how well a machine learning model performs on a given task, guiding model selection and improvement.
Common Metrics
Classification: Accuracy, precision, recall, F1, AUC-ROC. Regression: MSE, MAE, R-squared. NLP: BLEU, ROUGE, BERTScore, perplexity. LLMs: Human preference (Elo), MMLU, HumanEval, task-specific benchmarks.
Choosing the Right Metric
The metric must align with the business objective. Accuracy is misleading for imbalanced datasets. BLEU doesn't capture fluency. No single metric captures everything -- use multiple metrics and human evaluation for a complete picture.