AI Glossary

ZeRO Optimizer

A memory optimization that partitions optimizer states, gradients, and parameters across GPUs.

Overview

ZeRO (Zero Redundancy Optimizer), developed by Microsoft for DeepSpeed, dramatically reduces memory redundancy in data-parallel training. Standard data parallelism replicates the full model on each GPU; ZeRO partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and model parameters (ZeRO-3) across GPUs, communicating only when needed.

Key Details

ZeRO-3 can train models with trillions of parameters across hundreds of GPUs with near-linear scaling. ZeRO-Offload extends this by offloading computations to CPU, and ZeRO-Infinity adds NVMe SSD offloading. ZeRO is a core component of Microsoft's DeepSpeed library and is widely used for training large language models.

Related Concepts

data parallelismdistributed trainingpipeline parallelism

← Back to AI Glossary

Last updated: March 5, 2026