Benchmark
A standardized test or dataset used to evaluate and compare the performance of AI models on specific tasks.
Why Benchmarks Matter
Benchmarks provide a common yardstick for measuring progress. They enable fair comparison between models and track the state of the art over time. Without benchmarks, claims about model capability would be difficult to verify.
Key AI Benchmarks
MMLU: Massive Multitask Language Understanding, testing knowledge across 57 subjects. HumanEval: Code generation benchmark. GPQA: Graduate-level science questions. ImageNet: Image classification (1,000 classes). SWE-bench: Real-world software engineering tasks.
Benchmark Saturation
When models approach or exceed human performance on a benchmark, it loses its discriminating power. The AI field continuously creates harder benchmarks as models improve. Benchmark gaming (overfitting to test sets) is an ongoing concern.