On Attention Mechanisms

The core intuition

Attention is a learned, differentiable key-value lookup. Given a query vector, you compare it against a set of keys, weight the corresponding values by similarity, and take the weighted sum. That is the whole idea.

The power comes from what you do with it: if every position in a sequence can attend to every other position directly, you break the information bottleneck that recurrence imposes.

The information-theoretic reading

There is an interpretation through Shannon's entropy: the softmax over attention scores is a probability distribution, and the attention mechanism is choosing a distribution over memories rather than a single memory address.

The scaling by √d_k keeps this distribution from collapsing (high logit variance → near-deterministic softmax → gradient death). It is a numerically motivated trick that happens to correspond to keeping the entropy of the attention distribution from collapsing to zero.

The paper that made it central

Attention Is All You Need went further than any predecessor by removing recurrence entirely and showing that attention alone was enough. Not an augmentation of a sequential model, a replacement.

The insight was partly engineering (parallelism) and partly conceptual: the sequence positions matter, but the order in which you visit them does not. Encode position, then attend.

Open questions worth tracing

Why do large Transformers develop "attention heads" that correspond to syntactic roles no one explicitly trained for?
Is the multi-head architecture fundamental, or does it exist because a single high-dimensional head is harder to optimize?
What is the Bayesian interpretation of the attention weight distribution over context?