Gated Recurrent Unit

Motivation

The Gated Recurrent Unit (GRU) (Cho et al. 2014) is a simpler alternative to the LSTM. It merges the LSTM’s forget and input gates into a single “update” gate, eliminates the separate cell state, and uses one less gate overall. The result is a recurrence with \(\sim 75\%\) the parameter count of an LSTM and comparable performance on most sequence modeling tasks.

GRUs and LSTMs are usually interchangeable. Empirical comparisons show small, task-dependent differences. GRU is preferred when parameter count or compute matters; LSTM is the conservative default.

Architecture

A single state vector \(h_t \in \mathbb{R}^{d_h}\) — no separate cell state. Two sigmoid gates:

  • Update gate \(z_t\) — interpolates between keeping the old state and writing a new one.
  • Reset gate \(r_t\) — controls how much of the old state is used in computing the candidate update.

The recurrence:

\[ \begin{aligned} z_t &= \sigma(W_z [h_{t-1}; x_t] + b_z), \\ r_t &= \sigma(W_r [h_{t-1}; x_t] + b_r), \\ \tilde h_t &= \tanh(W_h [r_t \odot h_{t-1}; x_t] + b_h), \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t. \end{aligned} \]

Three weight matrices instead of the LSTM’s four. Total parameters: \(3(d_h^2 + d_h d_x + d_h)\).

Diagram: one GRU cell

The state line at the top runs through a single update gate \(z\) that interpolates between the old state and a candidate \(\tilde h\). The reset gate \(r\) damps \(h_{t-1}\) before it enters the candidate computation.

h_{t-1} h_t × 1 − z + [h_{t-1} ; x_t] σ reset r σ update z tanh candidate ĥ × × z h_t = (1 − z) ⊙ h_{t-1} + z ⊙ ĥ_t. The reset gate r damps the recurrent input to the candidate.

How It Differs from LSTM

The two key differences:

  • No separate cell state. The LSTM has both \(c_t\) (memory) and \(h_t\) (exposed output, also used for gates). The GRU has only \(h_t\). The cost of this simplification is that the GRU cannot independently control what is kept in memory and what is exposed externally.

  • Coupled forget and input gates. The LSTM has independent \(f_t\) (how much old to keep) and \(i_t\) (how much new to write). The GRU has a single \(z_t\) that handles both: \(h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t\). The convex combination structure means new information necessarily displaces old information.

  • No output gate. The LSTM has \(o_t\) controlling what part of the cell state is exposed. The GRU has no equivalent — the entire state is exposed at every step.

Why It Avoids Vanishing Gradients

Like the LSTM, the GRU has an additive state-update path. When \(z_t \approx 0\), the recurrence collapses to \(h_t \approx h_{t-1}\), giving \(\partial h_t / \partial h_{t-1} \approx I\). The gradient propagated through such a chain is approximately the identity, neither vanishing nor exploding. The mechanism is the same as the LSTM’s forget gate; only the parameterization differs.

When to Pick GRU vs. LSTM

  • GRU: smaller models, lower compute budget, faster training, comparable accuracy on most tasks. Often a slight win for tasks with shorter sequences or smaller datasets.
  • LSTM: longer sequences, larger models, when separate memory and output channels are useful. Standard baseline in most published RNN work.

The empirical evidence is mixed and noisy. Greff et al. (2017) found no consistent winner in a large hyperparameter search across language modeling, handwriting, and music tasks. Use the GRU if compute matters or as a parameter-efficient default; use the LSTM if you do not have a strong preference.

Practical Notes

  • Initialization of \(b_z\) to a small negative value biases the update gate toward keeping the old state — analogous to the LSTM’s positive forget-gate bias initialization.
  • Layer normalization helps for the same reasons it helps in LSTMs.
  • Dropout between layers; not on the recurrent connection.
  • Gradient clipping still useful even though vanishing is largely solved.

Why Both Are Less Used Now

Transformers replaced LSTMs and GRUs in most large-scale settings starting around 2018. The reasons are about scale and parallelism, not about gating: transformers parallelize across positions within a sequence, while RNNs cannot. For long sequences on GPU hardware, this is the dominant cost factor, and no improvement in gating fixes it.

For low-resource and online settings — streaming inference, on-device models, sequences too long for transformers’ quadratic memory — gated RNNs remain useful. State-space models (S4, Mamba) and linear-attention variants are recent alternatives that revisit the recurrent paradigm with structural improvements over both LSTM and GRU.

References

Cho, Kyunghyun, Bart van Merrienboer, Çaglar Gülçehre, et al. 2014. “Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation.” Empirical Methods in Natural Language Processing (EMNLP), 1724–34. https://aclanthology.org/D14-1179/.