Optimizer State
The additional memory used by optimization algorithms to track momentum, adaptive learning rates, and other per-parameter statistics beyond the model weights themselves.
Memory Impact
Adam optimizer stores 2 additional values per parameter (first and second moments). For a 7B parameter model in FP32, this means 56GB for optimizer state alone (2 * 7B * 4 bytes). This is why training requires much more memory than inference.
Optimization
Adafactor reduces optimizer memory by factorizing second moments. 8-bit Adam compresses optimizer states. ZeRO (used in DeepSpeed) shards optimizer state across GPUs.