Attention Is All You Need

The argument

For most of the 2010s, sequence modeling meant recurrence. To read a sentence, a network read it the way we imagine we read, one token after another, carrying a hidden state forward like a candle through a tunnel. The candle was the bottleneck. Recurrence forced computation to be sequential, and sequential computation does not parallelize.

The Transformer makes a different wager: that the relationships between positions in a sequence matter more than the order in which a machine visits them. Replace the candle with a room full of lamps, all lit at once. Every token can attend to every other token directly, in a single step.

Self-attention, plainly

Each token emits three vectors, a query, a key, and a value. The query asks a question; keys answer how relevant each other token is; values carry the content to be mixed in. The relevance scores are a softmax over scaled dot products:

Attention(Q, K, V) = softmax( Q Kᵀ / √d_k ) V

The √d_k scaling is not cosmetic. Without it, dot products in high dimensions grow large, the softmax saturates, and gradients vanish. A small denominator keeps the room of lamps from blowing out.

Why it mattered

Parallelism. Attention over a sequence is a matrix multiply. GPUs were waiting for exactly this shape of problem.
Distance is free. A recurrent network pays a cost proportional to the gap between two related words. Attention pays the same cost for neighbors and for words a thousand tokens apart.
It composed. Stack the blocks, add more data, and the curve kept bending. The architecture turned out to be a substrate, not a model.

The title was a joke that became a prophecy. Attention was, in fact, very nearly all you needed.

What to read it against

Read this beside Bahdanau's earlier work on attention for translation, and then forward into the scaling-law literature that treated the Transformer as a given. The interesting question is no longer how the mechanism works, it is why so much turned out to be expressible in it.