Gradient Accumulation
Simulating larger batch sizes by accumulating gradients across multiple forward-backward passes before updating.
Overview
Gradient accumulation is a technique that simulates training with large batch sizes when GPU memory is insufficient. Instead of updating model weights after each mini-batch, gradients are accumulated over multiple forward-backward passes and the weight update is applied after N steps, effectively creating a batch size of N times the mini-batch size.
Key Details
This is essential for training large models where the desired batch size doesn't fit in GPU memory. For example, if your GPU can only fit batch size 4 but you want batch size 32, you accumulate gradients over 8 steps. The technique produces mathematically equivalent results to true large-batch training (with some minor differences due to batch normalization). It's widely used in large language model training.