What is an Optimizer?
Picture a ball placed on the side of a hilly landscape in the dark. The ball wants to roll to the lowest point, the valley floor, but it cannot see the whole terrain. All it can feel is the slope directly beneath it, the direction that goes downhill from where it currently sits. An optimizer is the algorithm that decides how this ball moves: how big a step to take, in what direction, and how to adjust its strategy as the terrain changes.
In machine learning, an optimizer is the algorithm responsible for updating a model's parameters (its weights and biases) during training. The loss function tells the model how wrong its predictions are. The optimizer takes that feedback and determines exactly how to adjust each of the model's millions of parameters to reduce that error. It is the mechanism that turns the abstract concept of "learning" into concrete numerical updates.
Without an optimizer, a neural network is just a fixed mathematical function. With one, it becomes a system capable of improving itself. The choice of optimizer can dramatically affect how quickly a model trains, whether it converges at all, and the quality of the final result. It is one of the most critical decisions in any machine learning project.
Gradient Descent: The Foundation
Nearly every modern optimizer is built on the same fundamental idea: gradient descent. The concept is beautifully simple. The gradient of the loss function with respect to a parameter tells you the direction and rate of steepest increase in loss. To decrease the loss, you step in the opposite direction.
Mathematically, the update rule is straightforward: new weight equals old weight minus the learning rate times the gradient. The learning rate is a small positive number that controls the size of each step. Multiply the gradient (direction) by the learning rate (step size), subtract it from the current weight, and you get a new weight that should produce a slightly lower loss.
In batch gradient descent, you compute the gradient using the entire training dataset at once. This gives you the most accurate estimate of the true gradient, but it is extremely slow for large datasets because you must process every single example before taking one step.
Stochastic Gradient Descent (SGD) takes the opposite approach: it computes the gradient from a single randomly chosen training example. This is very fast but very noisy; each step is based on just one example, so the direction can vary wildly from step to step. Paradoxically, this noise can be beneficial because it helps the optimizer escape local minima and saddle points.
Mini-batch gradient descent is the practical compromise used by virtually everyone. It computes the gradient from a small random subset (typically 32 to 512 examples) of the training data. This balances the accuracy of batch gradient descent with the speed and beneficial noise of SGD. When people say "SGD" in practice, they usually mean mini-batch gradient descent.
Popular Optimizers: SGD, Adam, and Beyond
Vanilla SGD works, but it has significant limitations. It treats every parameter the same way, uses a fixed step size, and can oscillate in ravines where the surface curves much more steeply in one direction than another. Over the years, researchers have developed more sophisticated optimizers to address these problems.
SGD with Momentum adds a "memory" of previous gradient directions. Instead of just following the current gradient, the optimizer maintains a running average of past gradients and uses this to determine its direction. Think of a ball rolling downhill: momentum keeps it moving even through small bumps and flat spots, and it accelerates in consistent downhill directions. This dramatically reduces oscillation and speeds up convergence in many cases.
RMSProp (Root Mean Square Propagation) addresses a different problem: it adapts the learning rate for each parameter individually. Parameters with large, frequent gradients get smaller learning rates, while parameters with small, infrequent gradients get larger ones. This is especially useful for problems with sparse features, where some parameters need to be updated much more aggressively than others.
Adam (Adaptive Moment Estimation) combines the best ideas from momentum and RMSProp. It maintains both a running average of gradients (like momentum) and a running average of squared gradients (like RMSProp), using both to adaptively set the learning rate and direction for each parameter. Adam has become the default optimizer for most deep learning projects because it works well out of the box across a wide range of problems with minimal hyperparameter tuning.
AdamW is a variant that properly implements weight decay (a regularization technique that prevents weights from growing too large). It has become the preferred optimizer for training large language models and is used in models like GPT, BERT, and many other modern architectures. Other notable optimizers include AdaGrad, Nadam, LAMB (used for large-batch training), and more recent innovations like Lion and Sophia.
The Learning Rate Connection
The optimizer and the learning rate are deeply intertwined. The learning rate controls the size of each step the optimizer takes. No matter how sophisticated the optimizer, if the learning rate is wrong, training will suffer.
A learning rate that is too large causes the optimizer to take steps that are too big, overshooting the minimum and potentially causing the loss to diverge (increase without bound). The ball rolls down the hill so fast that it flies right past the valley and up the other side, bouncing wildly back and forth. In extreme cases, the loss explodes to infinity and training crashes.
A learning rate that is too small makes the optimizer painfully slow. Each step barely moves the parameters, and training takes an impractical amount of time. Worse, a tiny learning rate can trap the optimizer in local minima or saddle points that it would easily escape with larger steps.
Modern practice almost always uses learning rate schedules that change the learning rate over the course of training. A common strategy is to start with a higher learning rate for rapid initial progress, then gradually reduce it as training progresses to allow fine-grained convergence. Warmup schedules start with a very low rate and ramp up, which helps stabilize training in the early stages when gradients can be erratic.
Adaptive optimizers like Adam partially address the learning rate problem by maintaining per-parameter learning rates. But even with Adam, the base learning rate is a critical hyperparameter. A well-tuned learning rate can mean the difference between a model that converges in hours and one that never converges at all.
Key Takeaway
The optimizer is the navigator of the training process. While the loss function defines the destination (minimum loss) and the gradients point the direction, the optimizer decides exactly how to move through the loss landscape, how fast, how adaptively, and with what memory of past steps.
The key concepts to remember are as follows. Gradient descent is the foundation: move opposite to the gradient to reduce loss. SGD, momentum, and Adam represent an evolution of increasingly sophisticated optimization strategies. Adaptive optimizers like Adam adjust learning rates per-parameter, making them more robust. The learning rate is the most important hyperparameter, and it should typically be scheduled rather than fixed. And the choice of optimizer can significantly impact training speed, stability, and final model quality.
Every AI model you interact with was trained by an optimizer. The chatbot that answers your questions, the recommendation system that suggests your next video, the image generator that creates art from text, all of these were shaped by billions of tiny parameter updates, each one carefully computed by an optimizer working tirelessly to find the valley in a landscape of unimaginable complexity.
Next: What is Learning Rate? →