AI Glossary

Masked Language Modeling

A pre-training objective where random tokens in the input are replaced with a [MASK] token, and the model must predict the original tokens from context.

How BERT Uses It

During pre-training, 15% of input tokens are randomly selected. Of those, 80% are replaced with [MASK], 10% with a random token, and 10% are kept unchanged. The model learns to predict the original token using bidirectional context.

Why It Works

By masking tokens, the model is forced to build deep bidirectional representations -- it must understand context from both left and right. This produces embeddings that capture rich semantic information.

Comparison to Autoregressive

Masked LM (BERT-style) sees the full context but can only fill in blanks. Autoregressive LM (GPT-style) generates one token at a time but can produce arbitrary text. Each approach has strengths for different tasks.

← Back to AI Glossary

Last updated: March 5, 2026