Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
The argument
For most of the 2010s, sequence modeling meant recurrence. To read a sentence, a network read it the way we imagine we read, one token after another, carrying a hidden state forward like a candle through a tunnel. The candle was the bottleneck. Recurrence forced computation to be sequential, and sequential computation does not parallelize.
The Transformer makes a different wager: that the relationships between positions in a sequence matter more than the order in which a machine visits them. Replace the candle with a room full of lamps, all lit at once. Every token can attend to every other token directly, in a single step.
Self-attention, plainly
Each token emits three vectors, a query, a key, and a value. The query asks a question; keys answer how relevant each other token is; values carry the content to be mixed in. The relevance scores are a softmax over scaled dot products:
Attention(Q, K, V) = softmax( Q Kᵀ / √d_k ) V
The √d_k scaling is not cosmetic. Without it, dot products in high
dimensions grow large, the softmax saturates, and gradients vanish. A
small denominator keeps the room of lamps from blowing out.
Why it mattered
- Parallelism. Attention over a sequence is a matrix multiply. GPUs were waiting for exactly this shape of problem.
- Distance is free. A recurrent network pays a cost proportional to the gap between two related words. Attention pays the same cost for neighbors and for words a thousand tokens apart.
- It composed. Stack the blocks, add more data, and the curve kept bending. The architecture turned out to be a substrate, not a model.
The title was a joke that became a prophecy. Attention was, in fact, very nearly all you needed.
What to read it against
Read this beside Bahdanau's earlier work on attention for translation, and then forward into the scaling-law literature that treated the Transformer as a given. The interesting question is no longer how the mechanism works, it is why so much turned out to be expressible in it.
A working note tracing the idea of attention from early seq2seq work through the Transformer, connecting to information-theoretic roots.