AI Glossary

Optimizer State

The additional memory used by optimization algorithms to track momentum, adaptive learning rates, and other per-parameter statistics beyond the model weights themselves.

Memory Impact

Adam optimizer stores 2 additional values per parameter (first and second moments). For a 7B parameter model in FP32, this means 56GB for optimizer state alone (2 * 7B * 4 bytes). This is why training requires much more memory than inference.

Optimization

Adafactor reduces optimizer memory by factorizing second moments. 8-bit Adam compresses optimizer states. ZeRO (used in DeepSpeed) shards optimizer state across GPUs.

← Back to AI Glossary

Last updated: March 5, 2026