Checkpoint
A saved snapshot of a model's weights and training state at a specific point during training, enabling recovery from failures and selection of the best-performing version.
Why Checkpoints Matter
Training large models can take days or weeks. Checkpoints protect against hardware failures, allow resuming interrupted training, and enable selecting the best model from different points in training (e.g., before overfitting began).
What's Saved
Model weights, optimizer state (momentum, adaptive learning rates), current epoch/step, learning rate scheduler state, and training configuration. This allows perfectly resuming training from any checkpoint.
Best Practices
Save checkpoints at regular intervals and whenever validation performance improves. Keep the best N checkpoints by validation metric. For large models, checkpoint sharding distributes the save across multiple files.