Recurrent Neural Networks

Motivation

A recurrent neural network (RNN) processes an input sequence by maintaining a hidden state that is updated at each time step (Rumelhart et al. 1986). The same parameters are applied at every step, so the network can — in principle — handle sequences of arbitrary length, and the hidden state can carry information across arbitrary lags.

RNNs were the dominant sequence model for a decade (roughly 2010–2020) before transformers replaced them in most large-scale settings. They remain useful for low-resource and online sequence processing, and the conceptual core — recurrent state, weight sharing across time, backpropagation through time — remains foundational.

Vanilla RNN

Inputs \(x_1, \ldots, x_T \in \mathbb{R}^{d_x}\). Hidden state \(h_t \in \mathbb{R}^{d_h}\), initialized \(h_0 = 0\) (or learned). The recurrence is

\[ h_t = \sigma(W_{hh} h_{t-1} + W_{xh} x_t + b_h), \qquad y_t = W_{hy} h_t + b_y, \]

with shared parameters \(\theta = (W_{hh}, W_{xh}, W_{hy}, b_h, b_y)\) and elementwise nonlinearity \(\sigma\) (typically \(\tanh\)). The output \(y_t\) can be a per-step prediction (e.g., next-token logits in a language model) or read only at the final step (e.g., a sentiment label from a sentence).

The same weights are used at every step. This is the analog of weight sharing in CNNs: a single set of parameters defines how the state updates regardless of position in the sequence, giving the model an inductive bias for time-translation symmetry.

Diagram: the RNN unrolled across four time steps

The same cell (with shared weights \(W_{hh}, W_{xh}, W_{hy}\)) is applied at every step. Hidden state \(h_{t-1}\) flows into the next step.

What RNNs Are Used For

Sequence labeling. One label per input step (POS tagging, named-entity recognition, frame-level audio classification).
Sequence classification. One label per sequence — read the final hidden state.
Sequence-to-sequence. Encoder-decoder pairs of RNNs for machine translation, summarization, speech recognition. See sequence-to-sequence models.
Language modeling. Autoregressive next-token prediction. RNN language models defined the state of the art before transformers.
Online prediction. Streaming inputs where transformers’ quadratic context cost is prohibitive.

Training

Training an RNN minimizes a per-step or sequence-level loss summed over time:

\[ L = \sum_{t=1}^T \ell(\hat y_t, y_t). \]

Gradients with respect to the shared parameters require backpropagation through time: unfold the recurrence into a feedforward network with \(T\) layers (one per time step) and run backpropagation on the unfolded graph. The gradient at parameter \(\theta\) is the sum of contributions from every time step, since the same \(\theta\) is reused.

The principal training pathology is the vanishing and exploding gradient problem. Gradients propagated \(T\) steps through the recurrence pick up a product of \(T\) Jacobians; if those Jacobians have spectral radius \(> 1\) the gradient explodes, if \(< 1\) it vanishes. Vanilla RNNs are essentially untrainable for sequences longer than a few dozen steps for this reason.

The standard responses:

Architecture changes. LSTM and GRU introduce gating that gives the gradient a near-identity path through time, side-stepping the vanishing case.
Optimization changes. Gradient clipping addresses exploding gradients.
Truncated BPTT. Only backpropagate through a fixed window (e.g., \(50\) or \(100\) steps) rather than the full sequence. Computationally necessary for long sequences.
Better initialization. Identity-or-near-identity initialization of \(W_{hh}\) keeps early-training Jacobians close to \(I\).

Bidirectionality

A bidirectional RNN runs two RNNs in opposite directions and concatenates the hidden states: \(h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]\). The forward RNN sees \(x_{1:t}\); the backward RNN sees \(x_{t:T}\). So the combined state at \(t\) depends on the full sequence, which is useful for tasks where future context helps (e.g., POS tagging). Inference is offline — you cannot run a bidirectional RNN on a streaming input.

Variants and Successors

Stacked / deep RNNs. Multiple recurrent layers; the hidden state of layer \(\ell\) is the input to layer \(\ell + 1\).
LSTM and GRU. Standard gated variants. Use these instead of vanilla RNNs in any practical setting.
State-space models (S4, Mamba, RWKV). Modern alternatives that revisit the recurrent paradigm with structured-state-space dynamics. Linear time complexity in sequence length and improved long-range modeling.
Transformers. The dominant architecture for sequence modeling at scale. Quadratic in sequence length but parallelizable across positions, which RNNs are not.

Strengths and Weaknesses

Pros: - Linear complexity in sequence length. - Constant memory at inference (one hidden state). - Naturally handles streaming, online tasks.

Cons: - Sequential — cannot parallelize across time within a single sequence. This is the dominant cost on modern hardware. - Limited effective context length even with LSTM/GRU; long-range dependencies are weakly modeled. - Harder to scale than transformers because of the sequential bottleneck.

These trade-offs are why transformers won at scale and why RNNs survive in low-resource and streaming settings.

References

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.