AI Glossary

Gradient Descent

The core optimization algorithm used to train neural networks, iteratively adjusting model weights in the direction that reduces the loss function.

The Intuition

Imagine standing on a hilly landscape in fog. You want to reach the lowest valley. Gradient descent computes the slope (gradient) at your current position and takes a step downhill. Repeat until you reach a minimum.

Key Variants

Batch GD: Computes gradient over entire dataset. Accurate but slow. Stochastic GD (SGD): Computes gradient on a single example. Fast but noisy. Mini-batch GD: Computes gradient on a small batch. The practical standard.

Modern Optimizers

Adam (most popular), AdamW (with weight decay), SGD with momentum, and Adafactor. These add adaptive learning rates and momentum to basic gradient descent, converging faster and more reliably.

← Back to AI Glossary

Last updated: March 5, 2026