Layer Normalization
A normalization technique that standardizes activations across features within a single example, widely used in transformers as an alternative to batch normalization.
How It Works
For each example independently, compute the mean and variance across all features in a layer, normalize to zero mean and unit variance, then apply learnable scale and shift parameters.
Why Transformers Use It
Unlike batch normalization, layer norm doesn't depend on batch statistics. This makes it work with variable-length sequences and small batch sizes, both common in NLP. It also enables efficient inference with batch size 1.
Placement
Pre-LayerNorm (before attention/FFN) is now standard in modern transformers, as it provides more stable training than post-LayerNorm (the original placement). RMSNorm simplifies LayerNorm by removing the mean centering step.