Scaling Laws for Neural Language Models
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
¶
Power laws over hype
¶
Kaplan et al. measured language model loss across many scales and found smooth power-law relationships. Bigger models, more data, and more compute each reduce loss predictably within a regime.
¶
Implications for training
¶
If loss is predictable, so is the budget. You can estimate how much compute you need to reach a target quality before spending it. OpenAI and others used these curves to justify training runs that looked irrational on a smaller scale.
¶
Limits of the law
¶
Scaling laws describe average loss, not every capability. Breakthrough behaviors can appear suddenly. Data quality, architecture, and inference cost matter too. Still, this paper is the spreadsheet behind the scaling era.