What is Learning Rate?

Imagine you are trying to find the bottom of a dark valley by taking steps downhill. You can feel which direction slopes down, but you have to decide how big each step should be. Take steps that are too large and you might leap right over the lowest point and end up climbing the other side. Take steps that are too tiny and you will still be walking at sunrise without making meaningful progress. The learning rate is this step size.

The learning rate is a hyperparameter, a value set before training begins, that controls how much the model's parameters change in response to the estimated error each time they are updated. Mathematically, it is a small positive number (typically between 0.0001 and 0.1) that multiplies the gradient before it is subtracted from the current weights.

Despite being just a single number, the learning rate is widely considered the most important hyperparameter in deep learning. It has a more profound effect on training dynamics than almost any other setting. A well-chosen learning rate can mean the difference between a model that converges beautifully in a few hours and one that either explodes or crawls so slowly that it never finishes training.

The learning rate controls the fundamental trade-off between speed and precision. Large steps cover ground quickly but sacrifice accuracy. Small steps are precise but painfully slow. Finding the right balance is both an art and a science, and it is a challenge that every machine learning practitioner faces.

Too High vs. Too Low

Understanding the extremes of learning rate is the best way to build intuition for why it matters so much.

When the learning rate is too high, the optimizer takes enormous steps through the loss landscape. Instead of smoothly descending toward the minimum, the model's parameters swing wildly from one extreme to another. The loss does not decrease; instead, it oscillates erratically or even increases over time. In severe cases, the loss can explode to infinity, producing NaN (Not a Number) values in the model's weights and effectively destroying the training process. This is called divergence.

You can visualize this like a ball bouncing back and forth across a valley, each bounce taking it higher than the last. The ball has so much energy that it can never settle down. In a training log, you would see the loss jumping up and down unpredictably, sometimes accompanied by sudden spikes. If you encounter this behavior, reducing the learning rate is usually the first and most effective fix.

When the learning rate is too low, the opposite problem occurs. The optimizer inches forward in microscopic steps. Training progresses, but at a glacially slow pace. A model that should converge in hours might take days or weeks. Worse, an extremely low learning rate can cause the model to get trapped in a local minimum or saddle point. The steps are so small that the model lacks the momentum to escape these suboptimal regions of the loss landscape.

A too-low learning rate also means the model might converge to a sharp minimum rather than a flat one. Research has shown that sharp minima tend to generalize poorly to new data, while flatter minima, often reached with slightly larger learning rates, tend to produce models that perform better on unseen examples. So being too cautious with the learning rate does not just waste time; it can actually hurt the final model quality.

Finding the Sweet Spot

The ideal learning rate is one that allows the model to converge quickly and reliably to a good solution. But finding it is not always straightforward, because the optimal value depends on the model architecture, the dataset, the optimizer, the batch size, and many other factors.

One of the most popular techniques for finding a good learning rate is the learning rate range test, introduced by Leslie Smith. The idea is simple: start with a very small learning rate and gradually increase it over a short training run, logging the loss at each step. Plot the loss against the learning rate and look for the steepest downward slope. The learning rate at that steepest point is often a good starting value. If the loss starts increasing, you have gone too far.

Common starting points provide useful guidelines. For SGD, a learning rate of 0.01 to 0.1 is typical. For Adam, 0.001 (1e-3) is the default, and values between 1e-4 and 3e-4 are common in practice. For fine-tuning pretrained models, much smaller learning rates like 1e-5 or 2e-5 are standard because the model's weights are already in a good region and you do not want to disturb them too aggressively.

The batch size also interacts with the learning rate. A common rule of thumb, known as the linear scaling rule, suggests that when you double the batch size, you should also double the learning rate. This is because larger batches provide more accurate gradient estimates, allowing the optimizer to safely take larger steps. However, this rule breaks down at very large batch sizes, and more sophisticated techniques are needed.

Ultimately, finding the right learning rate often involves some experimentation. Most practitioners run a few short training trials with different learning rates, observe the loss curves, and choose the value that produces the fastest, smoothest convergence. This process, combined with learning rate schedules, is one of the most impactful optimizations you can make.

Learning Rate Schedules

In modern deep learning, it is rare to use a fixed learning rate throughout training. Instead, practitioners use learning rate schedules (also called learning rate policies or decay strategies) that systematically change the learning rate over the course of training.

The intuition is compelling. At the beginning of training, the model's parameters are randomly initialized and far from any good solution. Large learning rates are helpful here because they allow rapid exploration of the loss landscape. As training progresses and the model approaches a good minimum, smaller learning rates enable precise fine-tuning without overshooting.

Step decay is the simplest schedule: reduce the learning rate by a fixed factor (like dividing by 10) at predetermined epochs. For example, you might start with a learning rate of 0.1, reduce it to 0.01 at epoch 30, and to 0.001 at epoch 60. This approach is straightforward but requires manually choosing the decay points.

Cosine annealing smoothly decreases the learning rate following a cosine curve from its initial value to near zero over the training period. This produces a gradual, natural-feeling decay that many practitioners prefer over abrupt step changes. Cosine annealing with restarts periodically resets the learning rate to its initial value, allowing the model to escape local minima and explore new regions of the loss landscape.

Warmup is a technique where the learning rate starts very small and gradually increases to its target value over the first few hundred or thousand training steps. This is especially important for large models and large batch sizes, where the initial gradients can be very noisy and large learning rates would cause instability. The combination of warmup followed by cosine decay is the standard schedule for training modern large language models. Models like GPT, LLaMA, and many others use a warmup period of a few thousand steps followed by cosine decay to a minimum learning rate.

Cyclical learning rates oscillate between a minimum and maximum value throughout training. The idea, pioneered by Leslie Smith, is that periodically increasing the learning rate helps the model escape sharp minima and find flatter, more generalizable solutions. The "one-cycle" policy, where the learning rate rises from a low value to a maximum and then decreases again over a single training run, has proven remarkably effective for many tasks.

Key Takeaway

The learning rate is the single most influential knob you can turn when training a neural network. It controls the step size of the optimizer, directly determining how fast the model learns and whether it converges to a good solution.

The essential principles are clear. Too high and the model diverges, bouncing wildly across the loss landscape. Too low and the model crawls, potentially getting trapped in suboptimal regions. The sweet spot depends on the model, data, optimizer, and batch size, and can be found through range tests and experimentation. Learning rate schedules, especially warmup combined with cosine decay, are standard practice in modern deep learning and consistently produce better results than fixed learning rates.

When you hear about a breakthrough AI model that achieves remarkable performance, remember that behind the headlines, someone carefully tuned the learning rate. They ran experiments, plotted loss curves, and crafted a schedule that allowed the model to learn efficiently from billions of examples. The learning rate may be just a number, but it is the number that makes deep learning work.

Next: What is Backpropagation? →