AI Glossary

Evaluation Metric

A quantitative measure used to assess how well a machine learning model performs on a given task, guiding model selection and improvement.

Common Metrics

Classification: Accuracy, precision, recall, F1, AUC-ROC. Regression: MSE, MAE, R-squared. NLP: BLEU, ROUGE, BERTScore, perplexity. LLMs: Human preference (Elo), MMLU, HumanEval, task-specific benchmarks.

Choosing the Right Metric

The metric must align with the business objective. Accuracy is misleading for imbalanced datasets. BLEU doesn't capture fluency. No single metric captures everything -- use multiple metrics and human evaluation for a complete picture.

← Back to AI Glossary

Evaluation Metric

Common Metrics

Choosing the Right Metric

Related Articles

RAG Evaluation: Measuring Retrieval and Generation Quality

Related Concepts