Language Models are Few-Shot Learners

Scale as capability

Brown et al. trained GPT-3 at 175B parameters and found that many tasks improve smoothly with scale. The model does not need fine-tuning for every new benchmark. A short prompt with a few examples is often enough.

In-context learning

The surprising behavior is few-shot learning: the model reads a task description and a handful of input-output pairs in its context window, then completes the next item. No weight update occurs. The task is inferred at inference time.

What it changed

GPT-3 made scale feel like a product strategy. It also exposed the limits of prompt-only learning: strong on many benchmarks, brittle on others, and expensive to run. Every later frontier model inherits this paper's question: what does scale buy you, and what still requires training?