Batch Normalization
A technique that normalizes layer inputs by adjusting and scaling activations across a mini-batch, stabilizing and accelerating neural network training.
How It Works
For each mini-batch, compute the mean and variance of activations, normalize to zero mean and unit variance, then apply learnable scale and shift parameters. This is done independently for each feature/channel.
Benefits
Enables higher learning rates (faster training), reduces sensitivity to weight initialization, acts as a mild regularizer (reducing overfitting), and helps mitigate the vanishing/exploding gradient problem.
Alternatives
Layer Normalization: Normalizes across features instead of batch. Used in transformers because it works with variable batch sizes and sequence lengths. RMSNorm: A simplified layer norm used in LLaMA and modern LLMs.