Gradient Descent
The core optimization algorithm used to train neural networks, iteratively adjusting model weights in the direction that reduces the loss function.
The Intuition
Imagine standing on a hilly landscape in fog. You want to reach the lowest valley. Gradient descent computes the slope (gradient) at your current position and takes a step downhill. Repeat until you reach a minimum.
Key Variants
Batch GD: Computes gradient over entire dataset. Accurate but slow. Stochastic GD (SGD): Computes gradient on a single example. Fast but noisy. Mini-batch GD: Computes gradient on a small batch. The practical standard.
Modern Optimizers
Adam (most popular), AdamW (with weight decay), SGD with momentum, and Adafactor. These add adaptive learning rates and momentum to basic gradient descent, converging faster and more reliably.