AI Glossary

Adam Optimizer

An adaptive learning rate optimization algorithm that combines momentum and RMSProp, widely used as the default optimizer for training neural networks.

How It Works

Adam maintains per-parameter learning rates adapted based on first moment (mean) and second moment (variance) estimates of gradients. This means parameters that receive infrequent updates get larger learning rates, and vice versa.

Variants

AdamW: Decouples weight decay from the gradient update, fixing a subtle bug in original Adam. Now the standard for transformer training. Adafactor: Memory-efficient variant that doesn't store per-parameter moments.

← Back to AI Glossary

Last updated: March 5, 2026