AI Glossary

Layer Normalization

A normalization technique that standardizes activations across features within a single example, widely used in transformers as an alternative to batch normalization.

How It Works

For each example independently, compute the mean and variance across all features in a layer, normalize to zero mean and unit variance, then apply learnable scale and shift parameters.

Why Transformers Use It

Unlike batch normalization, layer norm doesn't depend on batch statistics. This makes it work with variable-length sequences and small batch sizes, both common in NLP. It also enables efficient inference with batch size 1.

Placement

Pre-LayerNorm (before attention/FFN) is now standard in modern transformers, as it provides more stable training than post-LayerNorm (the original placement). RMSNorm simplifies LayerNorm by removing the mean centering step.

← Back to AI Glossary

Layer Normalization

How It Works

Why Transformers Use It

Placement

Related Articles

Batch Normalization: Why It Works and How to Use It

Related Concepts