Sparks of Artificial General Intelligence: Early experiments with GPT-4

Beyond benchmarks

Bubeck et al. did not rely on a single score. They probed GPT-4 with tasks requiring planning, abstraction, and tool use. The paper documents behaviors that look like reasoning even when the mechanism is opaque.

The AGI framing

The title provoked debate. Critics said the evaluation was anecdotal; supporters said standard benchmarks had already saturated. The paper's value is descriptive: it catalogues what a frontier model could do in early 2023 before the public had access.

What remains open

"Sparks" is the careful word. The model still hallucinates, forgets, and fails on adversarial prompts. The paper asks whether capability clusters into general intelligence or a mosaic of narrow tricks. The question outlived the hype.