AI Glossary

Checkpoint

A saved snapshot of a model's weights and training state at a specific point during training, enabling recovery from failures and selection of the best-performing version.

Why Checkpoints Matter

Training large models can take days or weeks. Checkpoints protect against hardware failures, allow resuming interrupted training, and enable selecting the best model from different points in training (e.g., before overfitting began).

What's Saved

Model weights, optimizer state (momentum, adaptive learning rates), current epoch/step, learning rate scheduler state, and training configuration. This allows perfectly resuming training from any checkpoint.

Best Practices

Save checkpoints at regular intervals and whenever validation performance improves. Keep the best N checkpoints by validation metric. For large models, checkpoint sharding distributes the save across multiple files.

← Back to AI Glossary

Checkpoint

Why Checkpoints Matter

What's Saved

Best Practices

Related Articles

K-Nearest Neighbors: The Simplest ML Algorithm

Related Concepts