AI Glossary

Weight Decay

A regularization technique that adds a penalty proportional to the magnitude of model weights to the loss function, preventing weights from growing too large.

How It Works

Each weight update includes a term that shrinks weights toward zero: w = w - lr * (gradient + lambda * w). This discourages the model from relying too heavily on any single feature and reduces overfitting.

L2 vs Decoupled

Traditional L2 regularization adds the penalty to the loss (affects gradient computation). Decoupled weight decay (AdamW) applies the penalty directly to weights (separate from gradient). Decoupled is preferred for Adam-based optimizers.

← Back to AI Glossary

Weight Decay

How It Works

L2 vs Decoupled

Related Articles

Weight Initialization: Why It Matters More Than You Think

Deep Learning Optimizers: Adam, SGD, RMSProp Compared

Related Concepts