The learning rate is arguably the single most important hyperparameter in deep learning. Too high, and training diverges. Too low, and training takes forever -- or gets stuck in a poor local minimum. But the optimal learning rate is not a fixed value; it changes as training progresses. Learning rate scheduling dynamically adjusts the learning rate during training, and getting it right can dramatically improve both training speed and final model quality.
Why a Fixed Learning Rate Falls Short
At the beginning of training, the model's weights are far from optimal, and the loss landscape has steep gradients. A relatively large learning rate helps the optimizer take big steps and make rapid progress. But as training continues and the model approaches a good solution, those same large steps cause the optimizer to overshoot and oscillate around the minimum, never converging precisely.
A small fixed learning rate avoids the oscillation problem but wastes time in the early phase, creeping slowly toward the general region of the minimum when it could be taking larger steps.
The ideal strategy: start with a learning rate large enough for fast progress, then gradually reduce it to allow fine-grained convergence. Learning rate scheduling automates this intuition.
Common Scheduling Strategies
Step Decay
The simplest scheduling approach: multiply the learning rate by a constant factor (e.g., 0.1) at predetermined epochs. For example, start at 0.1, drop to 0.01 at epoch 30, then to 0.001 at epoch 60.
Step decay was the standard in early deep learning and remains popular for its simplicity. However, the sudden jumps in learning rate can cause instabilities, and choosing the right milestones requires experimentation.
Exponential Decay
Instead of sudden drops, exponential decay reduces the learning rate smoothly:
lr_t = lr_0 * gamma^t
where gamma is a decay factor (e.g., 0.95) and t is the epoch. This provides smoother transitions than step decay but can reduce the learning rate too aggressively if gamma is set too low.
Cosine Annealing
Cosine annealing decreases the learning rate following a cosine curve:
lr_t = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(pi * t / T))
where T is the total number of training steps. The learning rate starts at lr_max, decreases slowly at first, then more rapidly in the middle, and slowly again near the end. This smooth curve is one of the most widely used schedules in modern deep learning.
A variant called cosine annealing with warm restarts (SGDR) periodically resets the learning rate to its maximum value, creating multiple cosine cycles. This can help the optimizer escape poor local minima.
Key Takeaway
Cosine annealing has become the default learning rate schedule for many modern architectures. Its smooth decay profile avoids the instabilities of step decay while providing effective convergence.
Warmup: Starting Gently
Learning rate warmup starts training with a very small learning rate and gradually increases it to the target value over a number of initial steps. This technique is essential for training transformers and large models.
Why is warmup needed? At the very beginning of training, the model's parameters are random, and the gradients can be noisy and unreliable. A large learning rate combined with unstable gradients can send the model to a bad region of the parameter space from which it never recovers. Warmup gives the optimizer time to calibrate its gradient estimates (especially for adaptive methods like Adam) before taking large steps.
The original transformer paper used a warmup schedule that increased linearly for the first 4,000 steps, then decayed proportionally to the inverse square root of the step number. Most modern LLM training uses linear warmup followed by cosine decay.
Warmup with Cosine Decay
The most common schedule for training LLMs and transformers today:
- Linear warmup: Learning rate increases linearly from 0 (or near-zero) to the peak learning rate over a fixed number of warmup steps
- Cosine decay: After warmup, the learning rate follows a cosine curve down to a minimum value (often 10% of the peak)
One-Cycle Policy
Proposed by Leslie Smith, the one-cycle policy is a two-phase schedule:
- First half: Learning rate increases from a low value to a maximum
- Second half: Learning rate decreases from the maximum to a value much lower than the starting point
The counter-intuitive insight is that increasing the learning rate in the first half of training acts as a form of regularization. High learning rates force the model to find wider, flatter minima in the loss landscape, which tend to generalize better than sharp minima.
The one-cycle policy often achieves the same accuracy in fewer epochs compared to other schedules, a property Smith called super-convergence.
Key Takeaway
For LLM training, use linear warmup followed by cosine decay. For computer vision and smaller models, the one-cycle policy often provides the fastest convergence. Always include warmup when training transformers.
Finding the Right Peak Learning Rate
The learning rate range test, also from Leslie Smith, helps find a good peak learning rate. The procedure is simple: train for a few hundred iterations while exponentially increasing the learning rate from a very small value to a very large one, recording the loss at each step.
Plot loss versus learning rate. The optimal peak learning rate is typically found in the region where the loss is decreasing most rapidly -- before it starts to diverge. This test takes only minutes and can save hours of hyperparameter search.
For Adam and similar adaptive optimizers, common peak learning rates range from 1e-4 to 3e-4 for LLM pre-training. For SGD with momentum on image classification, peak rates of 0.1 to 0.5 are typical with the one-cycle policy.
Learning rate scheduling is not just an optimization trick. It is a fundamental part of training neural networks well. The right schedule can make your model train faster, generalize better, and converge to a superior solution. It is one of the first things to tune when a model is not training as expected.
