Attention
Motivation
When a model is processing one part of a sequence, it usually needs information from particular other parts of the sequence — and which parts depend on the input. A translator generating the verb of an English sentence may need information from the corresponding verb of the French source; a summarizer producing a sentence about a person may need that person’s name from earlier in the document.
Architectures that fixed which positions could exchange information — RNNs (only adjacent in time), CNNs (only adjacent in space) — could not flexibly handle long-range dependencies. Attention allows a model to dynamically retrieve information from any position of a source by computing a weighted average over source positions, where the weights are determined by the data.
Attention started as a fix for the encoder bottleneck in sequence-to-sequence models. It then displaced recurrence and convolution as the dominant sequence-mixing primitive, becoming the foundation of the Transformer and through it modern language, vision, and multimodal models.
The Attention Operation
Attention is a soft, differentiable retrieval. Given:
- A query \(q \in \mathbb{R}^{d_k}\),
- A set of keys \(k_1, \ldots, k_n \in \mathbb{R}^{d_k}\),
- Corresponding values \(v_1, \ldots, v_n \in \mathbb{R}^{d_v}\),
attention computes
\[ \operatorname{attn}(q, K, V) = \sum_{i=1}^{n} \alpha_i\, v_i, \qquad \alpha_i = \frac{\exp(\operatorname{score}(q, k_i))}{\sum_{j=1}^{n} \exp(\operatorname{score}(q, k_j))}. \]
The output is a weighted average of the values, with weights given by softmax over a similarity score between the query and each key. Where a hard dictionary lookup would return \(V[\arg\max_i \operatorname{score}(q, k_i)]\), attention returns a soft, differentiable approximation that allows gradients to flow back to all participants.
Scoring Functions
- Dot product. \(\operatorname{score}(q, k) = q^\top k\). Cheap; works when \(q\) and \(k\) have similar magnitudes.
- Scaled dot product (Vaswani et al. 2017). \(\operatorname{score}(q, k) = q^\top k / \sqrt{d_k}\). The standard choice in modern Transformers; the \(\sqrt{d_k}\) factor prevents the softmax from saturating as \(d_k\) grows.
- Additive (Bahdanau) (Bahdanau et al. 2015). \(\operatorname{score}(q, k) = w^\top \tanh(W_1 q + W_2 k)\). The original attention formulation; more parameters, slightly more expressive, but less GPU-friendly than dot product.
In matrix form, with rows of \(Q \in \mathbb{R}^{m \times d_k}\), \(K \in \mathbb{R}^{n \times d_k}\), \(V \in \mathbb{R}^{n \times d_v}\):
\[ \operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V. \]
Diagram: Q/K/V producing one output
A single query attends to four keys, producing softmax weights, which are used to take a weighted average of the corresponding values.
Cross-Attention vs. Self-Attention
Cross-attention. Queries come from one sequence (e.g., decoder states), keys and values from another (e.g., encoder states). This is the original Bahdanau et al. (2015) setup: a decoder query retrieves information from the encoder representation.
Self-attention. Queries, keys, and values all come from the same sequence — typically computed as different linear projections of one input matrix \(X\):
\[ Q = X W^Q, \qquad K = X W^K, \qquad V = X W^V. \]
Each position attends to every other position of the same sequence. Self-attention is the foundation of the Transformer; it lets a model exchange information between any pair of positions in a single layer, regardless of distance.
Multi-Head Attention
A single attention operation captures one type of relationship between positions. Multi-head attention runs \(H\) attention operations in parallel, each with its own learned projections \(\{W^Q_h, W^K_h, W^V_h\}\), then concatenates the outputs and projects back:
\[ \operatorname{MHA}(X) = \operatorname{Concat}(\text{head}_1, \ldots, \text{head}_H)\, W^O, \qquad \text{head}_h = \operatorname{Attention}(X W^Q_h, X W^K_h, X W^V_h). \]
Each head can specialize in a different aspect of the input — for example, one might track syntactic dependencies and another semantic relations. With per-head dimension \(d_k / H\), the total parameter count of multi-head attention is the same as one attention layer of dimension \(d_k\), while expressiveness improves.
Properties
Permutation equivariance. Attention is symmetric in the order of keys and values: shuffling them and shuffling the resulting weights gives the same output. Models that need to know which position they are attending to (almost always, for sequences) must add explicit positional encodings to the inputs — sinusoidal, learned, or rotary.
Quadratic cost. Self-attention computes scores between every pair of positions: \(O(n^2 d)\) time and \(O(n^2)\) memory in sequence length \(n\). This is the dominant bottleneck for long contexts. Many “efficient attention” variants (Linformer, Performer, sliding window, sparse attention, Longformer) trade exactness for sub-quadratic scaling. FlashAttention is an algorithmic improvement that computes exact attention with reduced memory traffic, retaining quadratic compute but making it practical for much longer sequences.
Soft retrieval. Attention weights are differentiable, so the entire model trains end-to-end. The model learns which positions to attend to as part of optimization, with no hard constraint on the connectivity pattern.
Variable input length. Attention adapts trivially to inputs of different lengths — no architectural change needed.
Beyond Sequences
Attention is a general “soft dictionary lookup” primitive, not specific to sequences. It now underpins:
- Vision Transformers. Patches of an image play the role of tokens; self-attention replaces convolution.
- Graph attention networks. Each node attends to its neighbors with learned weights.
- Memory-augmented networks. Attend to an external memory bank.
- Retrieval-augmented generation. Attend to retrieved documents as additional context.
- Multi-modal models. Cross-attention between modalities (text, image, audio) is the standard interface.
The shift from architecture-specific connectivity (recurrence, convolution) to learned, data-dependent connectivity (attention) is one of the central architectural moves of modern deep learning.