Backpropagation tells each weight how to change. The optimizer decides how much. This seemingly simple decision, how to update weights given their gradients, has profound effects on training speed, stability, and final model quality. The right optimizer can mean the difference between a model that converges in hours and one that never converges at all.
Stochastic Gradient Descent (SGD)
SGD is the simplest optimizer. It updates each weight by subtracting the gradient scaled by a learning rate:
w = w - lr * gradient
In practice, SGD computes gradients on mini-batches rather than the full dataset, introducing beneficial noise that helps escape local minima. However, vanilla SGD has significant drawbacks: it oscillates in steep dimensions, moves slowly in flat dimensions, and is highly sensitive to the learning rate.
SGD with Momentum
Momentum adds a velocity term that accumulates past gradients, smoothing out oscillations and accelerating movement in consistent directions:
v = beta * v - lr * gradient
w = w + v
The momentum coefficient (beta, typically 0.9) controls how much history to remember. Momentum helps SGD navigate ravines in the loss landscape where the surface curves steeply in one direction and gently in another.
"SGD with momentum is like a ball rolling downhill. It accelerates in directions of consistent gradient and dampens oscillations. This simple physical intuition makes optimization dramatically more efficient."
Nesterov Accelerated Gradient (NAG)
NAG improves on momentum by computing the gradient at the anticipated future position rather than the current position. This "look-ahead" provides a corrective factor that reduces overshooting. In practice, NAG converges slightly faster than standard momentum on many problems.
Adagrad
Adagrad adapts the learning rate for each parameter individually based on the sum of all past squared gradients. Parameters that receive large gradients get smaller learning rates; parameters that receive small gradients get larger learning rates. This is useful for sparse data where some features appear rarely.
The downside: the accumulated squared gradients grow monotonically, eventually shrinking the learning rate to near zero and stalling training.
RMSProp
RMSProp fixes Adagrad's decaying learning rate by using an exponentially decaying average of squared gradients instead of the full sum:
v = decay * v + (1 - decay) * gradient^2
w = w - lr * gradient / sqrt(v + epsilon)
The decay rate (typically 0.9) controls how quickly old gradients are forgotten. RMSProp adapts learning rates per-parameter while maintaining the ability to continue learning. It was proposed by Geoffrey Hinton in a lecture and has never been formally published, yet it remains widely used.
Key Takeaway
RMSProp normalizes the gradient by its recent magnitude. If a parameter's gradient has been large, the effective learning rate is reduced. If it has been small, the effective learning rate is increased. This adaptive behavior makes it much less sensitive to the initial learning rate choice.
Adam (Adaptive Moment Estimation)
Adam is the most popular optimizer in deep learning. It combines the best ideas from momentum and RMSProp:
- First moment (mean): An exponentially decaying average of past gradients, similar to momentum.
- Second moment (variance): An exponentially decaying average of past squared gradients, similar to RMSProp.
- Bias correction: Corrects the initial bias toward zero that occurs when the running averages start from zero.
m = beta1 * m + (1 - beta1) * gradient # momentum
v = beta2 * v + (1 - beta2) * gradient^2 # RMSProp
m_hat = m / (1 - beta1^t) # bias correction
v_hat = v / (1 - beta2^t) # bias correction
w = w - lr * m_hat / (sqrt(v_hat) + epsilon) # update
Default hyperparameters (beta1=0.9, beta2=0.999, epsilon=1e-8) work well for most problems, making Adam an excellent default choice.
AdamW (Adam with Decoupled Weight Decay)
AdamW is a corrected version of Adam that properly implements weight decay regularization. In standard Adam, L2 regularization is added to the gradient, but this interacts poorly with Adam's adaptive learning rates. AdamW decouples weight decay from the gradient update, applying it directly to the weights:
w = w - lr * (adam_update + weight_decay * w)
AdamW has become the default optimizer for Transformer training and is recommended over standard Adam whenever weight decay is used.
Choosing the Right Optimizer
- Default choice: Start with Adam or AdamW (with weight decay). They work well across most tasks with minimal tuning.
- Best final performance (computer vision): SGD with momentum often achieves slightly better final accuracy than Adam on image classification, but requires more careful learning rate tuning and scheduling.
- Transformers and NLP: AdamW is the standard choice. Combined with learning rate warmup and cosine decay, it reliably trains large language models.
- Sparse data or embeddings: Adam or Adagrad. Their per-parameter learning rates handle sparse gradients well.
- When in doubt: Adam with default parameters is rarely a bad choice.
Key Takeaway
Adam converges faster but SGD with momentum can generalize better with proper tuning. For quick experimentation, use Adam. For squeezing out the last bit of performance on a well-understood problem, tune SGD with momentum and a learning rate schedule.
Learning Rate Scheduling
The learning rate is the most important hyperparameter. Too high and training diverges; too low and training is painfully slow. Learning rate schedulers adjust the rate during training:
- Step decay: Reduce the learning rate by a factor at predetermined epochs (e.g., divide by 10 at epoch 30 and 60).
- Cosine annealing: Smoothly decrease the learning rate following a cosine curve. Popular for Transformers.
- Warmup: Start with a very small learning rate and linearly increase it for the first few thousand steps. Prevents early instability, especially with Adam.
- One-cycle policy: Increase the learning rate to a maximum, then decrease it below the initial rate. Often achieves fast convergence and good generalization.
- Reduce on plateau: Monitor validation loss and reduce the learning rate when it stops improving.
Practical Tips
- Start with Adam, lr=3e-4 for most deep learning experiments. This is a reliable default.
- Use learning rate warmup for Transformers and when training with large batch sizes.
- Monitor training curves. If the loss oscillates wildly, reduce the learning rate. If it plateaus early, the learning rate may be too low or the model too small.
- Use weight decay (0.01 is a common default) with AdamW for regularization.
- Try learning rate finder. Sweep the learning rate from very small to very large over one epoch and plot loss vs. learning rate. The optimal initial learning rate is just before the loss starts increasing.
- Do not over-tune the optimizer. Spending time on data quality, augmentation, and architecture usually yields bigger gains than optimizing optimizer hyperparameters.
The optimizer is the engine that drives training. Understanding how SGD, Adam, and their variants navigate the loss landscape gives you the intuition to debug training issues and make informed decisions about deep learning training configurations. While the choice of optimizer matters, remember that data, architecture, and regularization typically have a larger impact on final performance.
