Long Short-Term Memory

Motivation

A vanilla recurrent neural network cannot learn dependencies longer than a few dozen time steps because gradients propagated through the recurrence vanish or explode geometrically with sequence length. The Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber 1997) replaces the simple state update with a gated mechanism that gives gradients an additive, near-identity path through time. This solves the vanishing gradient problem and was the breakthrough that made long-sequence RNNs practical.

For roughly two decades — until transformers arrived — LSTMs were the default architecture for sequence modeling: machine translation, speech recognition, language modeling, and time-series forecasting all relied on them.

Architecture

The LSTM maintains two state vectors at each time step:

  • Cell state \(c_t \in \mathbb{R}^{d_h}\) — the long-term memory.
  • Hidden state \(h_t \in \mathbb{R}^{d_h}\) — what the network exposes externally and uses for output and gating computations.

Three sigmoid gates control how information flows through the cell:

  • Forget gate \(f_t\) — decides what to discard from \(c_{t-1}\).
  • Input gate \(i_t\) — decides what new candidate values to write.
  • Output gate \(o_t\) — decides what part of \(c_t\) to expose as \(h_t\).

Together with the candidate update \(\tilde c_t\), the recurrence is

\[ \begin{aligned} f_t &= \sigma(W_f [h_{t-1}; x_t] + b_f), \\ i_t &= \sigma(W_i [h_{t-1}; x_t] + b_i), \\ o_t &= \sigma(W_o [h_{t-1}; x_t] + b_o), \\ \tilde c_t &= \tanh(W_c [h_{t-1}; x_t] + b_c), \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde c_t, \\ h_t &= o_t \odot \tanh(c_t). \end{aligned} \]

Here \(\sigma\) is the sigmoid (so each gate is in \((0, 1)\)), \(\odot\) is elementwise product, and \([h_{t-1}; x_t]\) denotes concatenation. The parameter count is \(4 (d_h^2 + d_h d_x + d_h)\) — four times that of a vanilla RNN.

Diagram: one LSTM cell

The horizontal line at the top is the cell state \(c\) — the additive memory path. The forget gate \(f\) scales it elementwise, then the input gate \(i\) adds new candidate content \(\tilde c\). The output gate \(o\) produces \(h_t\) as a gated view of \(\tanh(c_t)\).

c_{t-1} c_t × + × [h_{t-1} ; x_t] σ forget f σ input i tanh cand. ĉ σ output o × tanh h_t Top red line is the cell state c — multiplied by f (forget), then added to i⊙ĉ (input). h_t = o ⊙ tanh(c_t).

Why It Solves Vanishing Gradients

The cell-state update has the form

\[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t. \]

The Jacobian of \(c_t\) with respect to \(c_{t-1}\) is \(\operatorname{diag}(f_t)\). When \(f_t \approx 1\) (the forget gate is “open”), this Jacobian is approximately the identity, and the gradient \(\partial c_T / \partial c_t\) propagated back through time is approximately \(\prod_{r=t+1}^T \operatorname{diag}(f_r) \approx I\). No geometric decay, no explosion. (proof)

Compare with a vanilla RNN, where the Jacobian is \(\operatorname{diag}(\tanh'(z_t)) W_{hh}\) — both factors generically have spectral radius less than \(1\), and the product over \(T\) steps decays geometrically.

The forget gate is therefore the critical piece. Initializing \(b_f\) to a positive value (say, \(1\)) so that \(f_t\) starts close to \(1\) is standard practice and substantially improves early training.

Cost and Comparisons

Each LSTM cell costs \(4\times\) the FLOPs of a vanilla RNN cell. In return:

  • Trains stably on sequences of hundreds to a few thousand steps.
  • Effective context length is much greater than vanilla RNN’s \(\sim 10\)\(50\) steps.
  • Still sequential — no parallelization across time, which is the limit transformers overcome.

Compared to the GRU, the LSTM has more parameters and one extra gate (the GRU merges input and forget gates and has no separate cell state). LSTMs and GRUs are comparable in performance on most tasks; LSTM is the default when in doubt.

Variants

  • Peephole connections. Gates also depend on \(c_{t-1}\), not just \(h_{t-1}\). Modest gains; rarely used today.
  • Bidirectional LSTM. Two LSTMs running in opposite directions, hidden states concatenated. Standard for sequence labeling tasks where future context helps.
  • Stacked LSTM. Multiple LSTM layers; output of layer \(\ell\) feeds layer \(\ell + 1\). Standard for deeper models. Skip connections between layers further help training depth.

Practical Notes

  • Forget-gate bias. Initialize \(b_f = 1\) (or higher). Without this, \(f_t\) starts near \(0.5\) and the cell state decays by half per step — a bad starting point for learning long dependencies.
  • Use layer norm if training stability is an issue. “LayerNorm-LSTM” is a standard variant for large models.
  • Dropout between layers, not on the recurrent connection itself (which would inject noise into the cell state every step).
  • Gradient clipping is still useful — exploding gradients can still occur even though vanishing is largely solved.

Why It Matters Historically

The LSTM is the architecture that made deep learning work for sequences. Before it, recurrent networks were largely a theoretical curiosity. After it, they were the dominant tool for any task involving language or time. Transformers have replaced LSTMs in most large-scale settings since 2018, but the conceptual contribution — gating, an additive state-update path, the identity Jacobian — survives in transformer residual streams, gated state-space models, and most modern sequence architectures.

References

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.