AI Glossary

Benchmark

A standardized test or dataset used to evaluate and compare the performance of AI models on specific tasks.

Why Benchmarks Matter

Benchmarks provide a common yardstick for measuring progress. They enable fair comparison between models and track the state of the art over time. Without benchmarks, claims about model capability would be difficult to verify.

Key AI Benchmarks

MMLU: Massive Multitask Language Understanding, testing knowledge across 57 subjects. HumanEval: Code generation benchmark. GPQA: Graduate-level science questions. ImageNet: Image classification (1,000 classes). SWE-bench: Real-world software engineering tasks.

Benchmark Saturation

When models approach or exceed human performance on a benchmark, it loses its discriminating power. The AI field continuously creates harder benchmarks as models improve. Benchmark gaming (overfitting to test sets) is an ongoing concern.

← Back to AI Glossary

Benchmark

Why Benchmarks Matter

Key AI Benchmarks

Benchmark Saturation

Related Articles

LLM Benchmarks: MMLU, HumanEval, and How We Measure Intelligence

K-Nearest Neighbors: The Simplest ML Algorithm

Object Detection in 2025: State of the Art Explained

Related Concepts