Transformer
Motivation
RNN-based sequence-to-sequence models — even with attention — process tokens sequentially. This sequential structure has two costs:
- No parallelism along the sequence axis. Step \(t\) cannot begin until step \(t - 1\) finishes, severely limiting GPU/TPU utilization during training.
- Long-range gradient propagation. Information must flow through \(O(n)\) recurrent steps to connect positions \(n\) apart, where vanishing/exploding gradients dominate.
The Transformer (Vaswani et al. 2017) drops recurrence entirely and uses self-attention as the only sequence-mixing primitive. The result: every position can interact with every other in a single layer, and all positions of a sequence can be processed in parallel during training. This combination of expressivity and hardware-friendliness made the Transformer the dominant architecture across natural language, vision, audio, and multi-modal modeling within a few years.
Architecture
A Transformer consists of an embedding stage and a stack of identical blocks.
Token Embedding
Each input token \(x_i\) is mapped to a vector \(e_i \in \mathbb{R}^d\) via an embedding table. The full input is a matrix \(E \in \mathbb{R}^{n \times d}\).
Positional Encoding
Self-attention is permutation equivariant, so the model has no notion of order without explicit positional information. The original Transformer adds fixed sinusoidal positional encodings \(P \in \mathbb{R}^{n \times d}\) to the embeddings: \(X = E + P\). Modern variants use learned positional embeddings, rotary positional embeddings (RoPE) that encode relative position via rotations applied to query and key vectors, or ALiBi that biases the attention scores directly by relative distance.
Transformer Block
The core unit. Each block has two sublayers, each wrapped in a residual connection and a layer norm:
\[ y = \operatorname{LayerNorm}(x + \operatorname{MHA}(x)), \qquad z = \operatorname{LayerNorm}(y + \operatorname{FFN}(y)). \]
- Multi-head self-attention (MHA). Computes self-attention with \(H\) heads operating in parallel on different learned projections of \(x\).
- Position-wise feedforward network (FFN). A two-layer MLP applied independently to each position: \(\operatorname{FFN}(z) = \sigma(z W_1 + b_1) W_2 + b_2\), with hidden width typically \(4d\).
- Residual connections. Every sublayer is wrapped in \(x + \operatorname{Sublayer}(x)\), allowing gradients to flow through the depth and letting the model learn small refinements.
- Layer normalization. Stabilizes training. Modern variants use pre-norm (\(x + \operatorname{Sublayer}(\operatorname{LN}(x))\)) for better training stability at scale.
Encoder vs. Decoder
The original Transformer is encoder-decoder, designed for translation:
- Encoder. Stack of blocks with bidirectional self-attention (every position can attend to every other).
- Decoder. Stack of blocks with two attention sublayers per block: masked self-attention (each position can only attend to itself and earlier positions) and cross-attention (decoder queries attend to encoder keys and values).
The masking in the decoder ensures the model cannot “cheat” by looking at future tokens during training. At inference time the decoder generates tokens one at a time autoregressively.
Output Head
A linear projection from the final hidden state to vocabulary logits, followed by softmax for autoregressive generation or cross-entropy loss for training. The projection often shares weights with the input embedding (tied embeddings).
Diagram: encoder-decoder block stack
Two encoder blocks process the source; two decoder blocks produce the target one position at a time. Decoder blocks add a cross-attention sublayer that queries the encoder output.
Three Common Configurations
Encoder-only (BERT, RoBERTa). Only the encoder stack, with bidirectional self-attention. Used for classification, embedding, retrieval, and tasks where every position needs context from every other. Pretrained with masked-language-modeling — randomly mask tokens and predict them from context.
Decoder-only (GPT, Claude, Llama, Gemini). Only the decoder stack, with causal-masked self-attention; no cross-attention because there is no separate encoder. Pretrained with next-token prediction. The dominant architecture for general-purpose language models, including instruction-tuned chat assistants.
Encoder-decoder (T5, mBART). Both stacks. Used for translation, summarization, and seq2seq tasks where input and output have a clean separation.
The decoder-only configuration has won most of the recent competition: pretraining is simpler (one objective, plain text), inference is uniform across tasks (everything is text in / text out), and scaling laws have proven especially favorable.
Why Transformers Work
- Constant path length. Every token can interact with every other in \(O(1)\) layers, without the gradient pathology of deep RNNs.
- Massive parallelism. Training computes attention over all positions in one matrix multiplication; full sequence parallelization on GPU/TPU.
- Soft retrieval inductive bias. Self-attention is an excellent match for language, where understanding any one word depends on a learned, dynamic, content-dependent set of others — exactly what attention computes.
- Scalability. Empirical scaling laws (Kaplan et al., Hoffmann et al.) show test loss decreases as a power law in parameters, training data, and compute. There is no apparent saturation up to \(10^{12}\) parameters and \(10^{13}\) tokens. Modern frontier LLMs are decoder-only Transformers near these scales.
Training and Inference Considerations
- Training cost. Dominated by attention’s \(O(n^2 d)\) and the \(4d\)-wide FFN. For long contexts, attention dominates.
- Inference. Autoregressive generation is sequential — one token per step. Each step recomputes attention over the prefix, not the whole sequence; KV caching stores key and value tensors of all previous tokens so each new step costs \(O(n d)\), not \(O(n^2 d)\).
- Long contexts. Quadratic attention is the central scaling pain point. FlashAttention improves the constant factors and memory profile dramatically without changing the algorithm. Sparse, sliding-window, and linear-attention variants change the algorithm, trading exactness for sub-quadratic scaling.
- Mixed precision and quantization. Modern training and inference use fp16/bf16 with selective fp32; inference frequently uses int8 or 4-bit quantization, often with negligible quality loss.
Limitations
- Quadratic attention. Long-context training and inference are expensive and the architecture has no fundamental affordance for very long sequences.
- No structural inductive bias. Transformers are general-purpose to a fault: they have no built-in notion of tree, grid, or graph structure, and learn such structure (when needed) only from data.
- Position encoding fragility. Generalizing to context lengths longer than those seen in training has been a persistent challenge; positional schemes (RoPE, ALiBi) are continually being revised.
- Hallucination, calibration, reasoning. These are characteristic problems of trained Transformers, not architectural ones — but the architecture sets the ceiling on what training can fix.