Papyros

Archive / paper

Scaling Laws for Neural Language Models

The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.

Power laws over hype

Kaplan et al. measured language model loss across many scales and found smooth power-law relationships. Bigger models, more data, and more compute each reduce loss predictably within a regime.

Implications for training

If loss is predictable, so is the budget. You can estimate how much compute you need to reach a target quality before spending it. OpenAI and others used these curves to justify training runs that looked irrational on a smaller scale.

Limits of the law

Scaling laws describe average loss, not every capability. Breakthrough behaviors can appear suddenly. Data quality, architecture, and inference cost matter too. Still, this paper is the spreadsheet behind the scaling era.