Scaling Law
Empirical relationships showing that model performance improves predictably as a power law with more compute, data, or parameters.
Key Findings
Kaplan et al. (2020) showed that loss decreases as a power law with model size, dataset size, and compute. Chinchilla (2022) showed that most models are undertrained -- optimal training scales data and parameters equally.
Implications
Scaling laws enable predicting the performance of larger models before training them. They guide compute budget allocation. The 'bitter lesson' (Rich Sutton) suggests that scaling general methods consistently outperforms clever engineering.