LLM Benchmarks: MMLU, HumanEval, and How We Measure Intelligence

Every time a new large language model is released, it comes with a dazzling array of benchmark scores. GPT-4 scores 86.4% on MMLU. Claude achieves top marks on HumanEval. Gemini leads on MATH. But what do these numbers actually mean? How should you interpret them, and what are their limitations? This guide breaks down the most important LLM benchmarks and explains how the AI community measures -- and sometimes misjudges -- intelligence.

Why Benchmarks Matter (and Why They Fail)

Benchmarks serve as the common language for comparing language models. Without standardized tests, claims about model capabilities would be impossible to verify. Benchmarks provide reproducible, quantitative measures that allow researchers and practitioners to track progress and make informed decisions about which model to use.

However, benchmarks are inherently reductive. They compress the vast space of language understanding into a single number, inevitably missing important dimensions of capability. A model might score well on multiple-choice knowledge questions while failing at practical tasks like following complex instructions or maintaining consistency in long conversations.

The Major Benchmarks Explained

MMLU (Massive Multitask Language Understanding)

MMLU is perhaps the most widely cited LLM benchmark. It consists of 57 subjects ranging from elementary math to professional law, medicine, and philosophy, with approximately 14,000 multiple-choice questions. The test covers the breadth of human knowledge at various difficulty levels, from high school to professional expert.

MMLU has become the default headline number for LLM capabilities, but it has significant limitations. The multiple-choice format does not test the ability to generate, explain, or reason through problems. And because the test questions are publicly available, there are concerns about benchmark contamination -- models may have seen the test questions during training.

HumanEval and MBPP

HumanEval, created by OpenAI, tests code generation ability. It presents 164 Python programming problems and evaluates whether the model can generate correct, runnable solutions. MBPP (Mostly Basic Python Problems) provides a larger set of 974 simpler programming tasks. These benchmarks measure a practical, verifiable capability: can the model write code that actually works?

HellaSwag

HellaSwag tests commonsense reasoning by asking models to complete sentences in a way that makes physical and social sense. While it was challenging for models when introduced in 2019, most modern LLMs now achieve near-perfect scores, making it less useful for distinguishing between current frontier models.

GSM8K and MATH

These benchmarks focus on mathematical reasoning. GSM8K contains 8,500 grade school math word problems, while MATH presents 12,500 more challenging problems from mathematics competitions. These benchmarks test not just mathematical knowledge but the ability to reason through multi-step problems.

Arena Elo (Chatbot Arena)

Unlike automated benchmarks, Chatbot Arena uses human evaluation. Users interact with two anonymous models side by side and vote on which response they prefer. This produces an Elo rating similar to chess rankings. Arena Elo is widely considered the most reliable indicator of real-world model quality, as it captures the full complexity of human preferences.

Key Takeaway

No single benchmark captures the full capabilities of an LLM. The most informative evaluation combines automated benchmarks across multiple domains with human evaluation in realistic settings.

The Benchmark Contamination Problem

One of the most serious issues in LLM evaluation is benchmark contamination: when test questions appear in the model's training data. Because LLMs are trained on vast internet corpora, and benchmark datasets are often publicly available, models may have memorized answers rather than learned to reason about them.

This problem is difficult to detect and even harder to prevent. Some approaches include:

Dynamic benchmarks: Creating new test questions that could not have appeared in training data.
Private test sets: Keeping a portion of the benchmark hidden from the public.
Contamination analysis: Checking whether models perform suspiciously well on specific questions that appear verbatim in training data.
Canary strings: Embedding unique identifiers in benchmark data to detect if models have seen it.

Emerging Evaluation Approaches

The limitations of existing benchmarks have spurred development of new evaluation methods that better capture real-world model capabilities.

GPQA (Graduate-Level Google-Proof QA)

GPQA consists of questions written by domain experts that are specifically designed to be difficult even for highly educated non-experts with access to Google. This makes it resistant to contamination and tests genuine expert-level reasoning.

SWE-bench

SWE-bench evaluates models on real software engineering tasks: resolving actual GitHub issues from popular open-source projects. This tests practical coding ability in a realistic setting, far beyond the toy problems in HumanEval.

LMSYS Chatbot Arena

The human-evaluation approach of Chatbot Arena has become increasingly influential. Its Elo ratings correlate well with user satisfaction in real deployments and capture dimensions of quality that automated benchmarks miss, including tone, helpfulness, and conversational ability.

How to Use Benchmarks Wisely

When evaluating LLMs for your own use case, keep these principles in mind:

Look at task-specific benchmarks: If you need a model for coding, prioritize HumanEval and SWE-bench over MMLU. If you need mathematical reasoning, focus on GSM8K and MATH.
Consider multiple benchmarks: A model that excels on one benchmark but underperforms on others may have been specifically optimized for that test.
Value human evaluation: Arena Elo and user studies provide the most reliable signal about real-world performance.
Run your own evaluations: The best benchmark for your use case is one built from your actual data and tasks. Generic benchmarks can guide initial model selection, but custom evaluation is essential for production deployment.
Watch for contamination: Be skeptical of models that show large jumps on specific benchmarks without corresponding improvements elsewhere.

Key Takeaway

Benchmarks are useful tools but imperfect measures. The best practice is to treat them as starting points for model selection, then validate with custom evaluations tailored to your specific use case.

The Future of LLM Evaluation

The field is moving toward more sophisticated evaluation that goes beyond simple accuracy scores. Future benchmarks will likely emphasize multi-step reasoning, long-context understanding, tool use, and the ability to interact effectively in complex, real-world scenarios. The goal is evaluation that truly measures whether models can be useful and reliable partners in the diverse range of tasks we need them for -- not just whether they can pass a multiple-choice exam.

LLM Benchmarks: MMLU, HumanEval, and How We Measure Intelligence

Why Benchmarks Matter (and Why They Fail)

The Major Benchmarks Explained

MMLU (Massive Multitask Language Understanding)

HumanEval and MBPP

HellaSwag

GSM8K and MATH

Arena Elo (Chatbot Arena)

Key Takeaway

The Benchmark Contamination Problem

Emerging Evaluation Approaches

GPQA (Graduate-Level Google-Proof QA)

SWE-bench

LMSYS Chatbot Arena

How to Use Benchmarks Wisely

Key Takeaway

The Future of LLM Evaluation

Related Posts

LLM Hallucinations: Why AI Makes Things Up and How to Fix It

Open-Source vs Closed-Source LLMs: Which Should You Choose?

Code Generation with LLMs: Copilot, Codex, and Beyond