ZeRO Optimizer
A memory optimization that partitions optimizer states, gradients, and parameters across GPUs.
Overview
ZeRO (Zero Redundancy Optimizer), developed by Microsoft for DeepSpeed, dramatically reduces memory redundancy in data-parallel training. Standard data parallelism replicates the full model on each GPU; ZeRO partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and model parameters (ZeRO-3) across GPUs, communicating only when needed.
Key Details
ZeRO-3 can train models with trillions of parameters across hundreds of GPUs with near-linear scaling. ZeRO-Offload extends this by offloading computations to CPU, and ZeRO-Infinity adds NVMe SSD offloading. ZeRO is a core component of Microsoft's DeepSpeed library and is widely used for training large language models.
Related Concepts
data parallelism • distributed training • pipeline parallelism