Weight Decay
A regularization technique that adds a penalty proportional to the magnitude of model weights to the loss function, preventing weights from growing too large.
How It Works
Each weight update includes a term that shrinks weights toward zero: w = w - lr * (gradient + lambda * w). This discourages the model from relying too heavily on any single feature and reduces overfitting.
L2 vs Decoupled
Traditional L2 regularization adds the penalty to the loss (affects gradient computation). Decoupled weight decay (AdamW) applies the penalty directly to weights (separate from gradient). Decoupled is preferred for Adam-based optimizers.