Perplexity
A metric for evaluating language models that measures how 'surprised' the model is by test data. Lower perplexity means the model predicts the text better.
The Math
Perplexity is the exponentiated average negative log-likelihood per token. A perplexity of 10 means the model is, on average, as uncertain as choosing uniformly among 10 options for the next token.
Interpretation
Lower is better. A perplexity of 1 means perfect prediction. State-of-the-art LLMs achieve perplexities in the single digits on standard benchmarks. Perplexity is the standard intrinsic evaluation metric for language models.
Limitations
Perplexity doesn't directly measure usefulness, safety, or instruction-following ability. A model with low perplexity might still generate harmful or unhelpful content. That's why modern LLM evaluation also uses human preference ratings and task-specific benchmarks.