AI Glossary

Masked Language Modeling

A pre-training objective where random tokens in the input are replaced with a [MASK] token, and the model must predict the original tokens from context.

How BERT Uses It

During pre-training, 15% of input tokens are randomly selected. Of those, 80% are replaced with [MASK], 10% with a random token, and 10% are kept unchanged. The model learns to predict the original token using bidirectional context.

Why It Works

By masking tokens, the model is forced to build deep bidirectional representations -- it must understand context from both left and right. This produces embeddings that capture rich semantic information.

Comparison to Autoregressive

Masked LM (BERT-style) sees the full context but can only fill in blanks. Autoregressive LM (GPT-style) generates one token at a time but can produce arbitrary text. Each approach has strengths for different tasks.

← Back to AI Glossary

Masked Language Modeling

How BERT Uses It

Why It Works

Comparison to Autoregressive

Related Articles

K-Nearest Neighbors: The Simplest ML Algorithm

Language Detection with AI: Identifying 100+ Languages

Model Deployment: From Jupyter to Production APIs

Model Monitoring in Production: Detecting Drift and Degradation

BERT Explained: Bidirectional Understanding in NLP

Related Concepts