AI Glossary

Vanishing Gradient Problem

A training difficulty where gradients become extremely small as they propagate backward through many layers, preventing early layers from learning effectively.

Causes

When using activation functions like sigmoid or tanh, gradients are multiplied by values less than 1 at each layer. After many layers, the gradient approaches zero, so early layers receive essentially no learning signal.

Solutions

ReLU activation (gradient is 1 for positive inputs). Skip/residual connections. Batch/layer normalization. Better weight initialization (He, Xavier). LSTM gates (for sequence models). These solutions enabled training networks with hundreds of layers.

← Back to AI Glossary

Last updated: March 5, 2026