Batch Size
The number of training examples processed together in one forward/backward pass during neural network training.
Why It Matters
Batch size affects training speed, memory usage, and model quality. Larger batches enable better GPU utilization and more stable gradient estimates. Smaller batches introduce noise that can help generalization.
Common Choices
Typical batch sizes range from 16 to 512 for most tasks. LLM training uses much larger batches (thousands to millions of tokens). The optimal batch size depends on the model, dataset, and available hardware.
Mini-Batch Gradient Descent
In practice, almost all training uses mini-batches -- a compromise between processing one example at a time (stochastic) and the entire dataset at once (batch). This balances computational efficiency with the regularizing effect of gradient noise.