Sequence-to-Sequence Models
Motivation
Many problems map a variable-length input sequence to a variable-length output sequence: machine translation, summarization, speech recognition, code generation, dialogue. Models for these problems must
- Process inputs of arbitrary length without a fixed-size feature vector,
- Produce outputs of arbitrary length,
- Allow the lengths and the alignment between inputs and outputs to vary per example.
A sequence-to-sequence (seq2seq) model factorizes the problem into an encoder that consumes the input and a decoder that produces the output, joined by some shared representation. The framework (Sutskever et al. 2014; Cho et al. 2014) unified what had been a zoo of task-specific architectures and is the conceptual ancestor of every modern translation, summarization, and dialogue model.
The Original RNN Encoder-Decoder
Both encoder and decoder are recurrent neural networks (LSTMs in the 2014 paper).
Encoder. Read the input tokens \(x_1, \ldots, x_n\) one at a time, updating a hidden state \(h_t = \text{RNN}(h_{t-1}, x_t)\). After consuming the full input, \(h_n\) is a fixed-size vector intended to summarize the entire input sequence.
Decoder. Initialize a decoder RNN with \(h_n\). At each step \(t\), take the previously generated token \(y_{t-1}\) as input, update the decoder hidden state, and predict \(y_t\) from a softmax over the output vocabulary. Generate until an <eos> token is produced.
Training. Maximize the conditional log-likelihood
\[ \log P(y_1, \ldots, y_m \mid x_1, \ldots, x_n) = \sum_{t=1}^{m} \log P(y_t \mid y_{<t}, x_{1:n}). \]
During training, the previous true token \(y_{t-1}\) is fed in at each step (teacher forcing); during inference, the previous predicted token is used.
Inference. Greedy decoding selects \(\arg\max_{y} P(y \mid y_{<t}, x_{1:n})\) at each step. Beam search maintains the top-\(k\) partial sequences and is the standard improvement: greedy decoding’s local choices are often globally suboptimal.
Diagram: encoder-decoder with thought vector
The encoder consumes the source tokens and produces a single fixed-size summary \(h_n\) (the “thought vector”). The decoder is initialized with \(h_n\) and emits target tokens one at a time, conditioning on the previously generated token.
The Bottleneck Problem
The fixed-size encoder state \(h_n\) must encode the entire input. For short inputs this works; for long inputs the model degrades sharply. The encoder is forced to compress relevant information into a fixed-dimensional vector, losing positional detail and long-distance dependencies.
This bottleneck was the central obstacle to seq2seq performance on long sequences.
Attention
Bahdanau et al. (2015) eliminated the bottleneck with an attention mechanism. Instead of compressing the input into \(h_n\), the encoder produces all hidden states \(h_1, \ldots, h_n\). At each decoding step \(t\), the decoder computes a context vector
\[ c_t = \sum_{i=1}^{n} \alpha_{t,i}\, h_i, \qquad \alpha_{t,i} = \text{softmax}_i\bigl(\text{score}(s_{t-1}, h_i)\bigr), \]
where \(s_{t-1}\) is the decoder’s hidden state. The decoder is conditioned on \(c_t\) in addition to its own state and the previous token.
The attention weights \(\alpha_{t,i}\) act as a soft alignment between input position \(i\) and output position \(t\). The model learns these alignments end-to-end. Attention immediately closed most of the long-sequence performance gap and became the standard architecture for seq2seq tasks.
The Transformer
Vaswani et al. (2017) introduced the Transformer, which replaced the RNN backbone of seq2seq with stacked self-attention layers. The encoder is a stack of self-attention blocks; the decoder is a stack of masked self-attention blocks plus cross-attention to the encoder’s outputs.
The advantages over RNN seq2seq are mostly computational:
- Parallelism. Self-attention processes all positions of a sequence simultaneously. RNN encoders and decoders are inherently sequential along time; Transformers fully exploit GPU parallelism during training.
- Long-range dependencies. Self-attention has \(O(1)\) path length between any two positions; RNNs have \(O(n)\) paths whose gradients vanish or explode.
- Scalability. Transformer training scales smoothly to hundreds of billions of parameters on trillions of tokens.
Modern Seq2seq
Modern seq2seq models — Google Translate, summarization systems, instruction-tuned chat assistants — are encoder-decoder Transformers (T5, mBART, NLLB) or decoder-only Transformers used in a seq2seq style (GPT-style models prompted with the input followed by an output marker).
The encoder-decoder template still defines the abstraction; what has changed since 2014 is the choice of building block (Transformer vs RNN), the scale (billions of parameters vs millions), and the breadth of tasks treated as seq2seq (almost everything: classification as “predict the label token,” QA as “predict the answer span,” code generation, multi-step reasoning).