AI Glossary

Gradient Accumulation

Simulating larger batch sizes by accumulating gradients across multiple forward-backward passes before updating.

Overview

Gradient accumulation is a technique that simulates training with large batch sizes when GPU memory is insufficient. Instead of updating model weights after each mini-batch, gradients are accumulated over multiple forward-backward passes and the weight update is applied after N steps, effectively creating a batch size of N times the mini-batch size.

Key Details

This is essential for training large models where the desired batch size doesn't fit in GPU memory. For example, if your GPU can only fit batch size 4 but you want batch size 32, you accumulate gradients over 8 steps. The technique produces mathematically equivalent results to true large-batch training (with some minor differences due to batch normalization). It's widely used in large language model training.

Related Concepts

batch size • distributed training • gradient descent

← Back to AI Glossary