AI Glossary

BLEU Score

Bilingual Evaluation Understudy -- a metric for evaluating machine translation quality by comparing generated text against reference translations using n-gram overlap.

How It Works

BLEU measures the precision of n-grams (1-4 word sequences) in the generated text against reference texts. A brevity penalty discourages overly short translations. Scores range from 0 to 1 (often reported as 0-100).

Limitations

BLEU correlates poorly with human judgments for many tasks beyond translation. It ignores semantic meaning, word order beyond n-grams, and fluency. Modern alternatives include BERTScore, COMET, and human evaluation.

← Back to AI Glossary

Last updated: March 5, 2026