Gradient Checkpointing
Trading compute for memory by recomputing intermediate activations during backpropagation instead of storing them.
Overview
Gradient checkpointing is a memory optimization technique that reduces the memory footprint of training deep neural networks by storing only a subset of intermediate activations during the forward pass. During backpropagation, the discarded activations are recomputed from the nearest checkpoint.
Key Details
This trades roughly 30% more computation for a dramatic reduction in memory usage (proportional to the square root of the number of layers instead of linear). It enables training much deeper models or using larger batch sizes on the same hardware. Gradient checkpointing is essential for training large transformers and is supported natively in PyTorch and TensorFlow.
Related Concepts
backpropagation • distributed training • mixed precision training