LSTM Cell-State Path Mitigates Vanishing Gradients

Claim

The LSTM cell-state recurrence

\[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t \]

has Jacobian \(\partial c_t / \partial c_{t-1}\) approximately equal to the identity when the forget gate satisfies \(f_t \approx 1\). The gradient propagated through \(T\) time steps,

\[ \frac{\partial c_T}{\partial c_0} = \prod_{t=1}^T \frac{\partial c_t}{\partial c_{t-1}}, \]

is therefore approximately \(I\) — neither vanishing nor exploding (Hochreiter and Schmidhuber 1997). This is a qualitative improvement over the vanilla recurrent network, where the analogous Jacobian is \(\operatorname{diag}(\tanh'(z_t)) W_{hh}\) with norm generically less than \(1\).

Setup

Note: this argument is about the cell-state path \(c_0 \to c_1 \to \cdots \to c_T\), not about the full LSTM Jacobian \(\partial h_t / \partial h_{t-1}\), which involves additional terms through the gates and the output. The cell-state path is what matters for long-range gradient flow because the gates modulate the path but do not block it.

The Cell-State Jacobian

The LSTM cell-state update is

\[ c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t. \]

The gates \(f_t\), \(i_t\) and the candidate \(\tilde c_t\) depend on \(h_{t-1}\) and \(x_t\). Crucially they do not depend directly on \(c_{t-1}\) in the standard LSTM formulation (they depend only on \(h_{t-1}\), which is computed from \(c_{t-1}\) via \(h_{t-1} = o_{t-1} \odot \tanh(c_{t-1})\)).

Decompose \(\partial c_t / \partial c_{t-1}\) into a direct path and an indirect path through \(h_{t-1}\):

\[ \frac{\partial c_t}{\partial c_{t-1}} = \underbrace{\operatorname{diag}(f_t)}_{\text{direct}} + \underbrace{\frac{\partial c_t}{\partial h_{t-1}} \cdot \frac{\partial h_{t-1}}{\partial c_{t-1}}}_{\text{through gates}}. \]

The direct term is the Hadamard-product factor of \(f_t\). The indirect term flows through three sigmoid gates and the candidate’s tanh, all of which are saturating activations — these contributions are small whenever the gates are near their saturation values.

When forget gates are near \(1\):

The direct term is \(\operatorname{diag}(f_t) \approx I\).
The indirect term involves products of sigmoid and tanh derivatives, all bounded by \(1/4\) and \(1\) respectively.

So \(\partial c_t / \partial c_{t-1} \approx I + \text{small}\).

The Identity-Like Product

Iterating \(T\) steps,

\[ \frac{\partial c_T}{\partial c_0} = \prod_{t=1}^T \frac{\partial c_t}{\partial c_{t-1}} \approx \prod_{t=1}^T \operatorname{diag}(f_t) = \operatorname{diag}\!\left(\prod_t f_t\right). \]

This is a diagonal matrix with entries \(\prod_t f_{t,k}\) for each coordinate \(k\). If the \(k\)-th forget gate is near \(1\) at every step (\(f_{t,k} \approx 1\)), then \(\prod_t f_{t,k} \approx 1\) — the gradient through coordinate \(k\) is preserved over arbitrary time spans.

Why This Is the Critical Property

The vanishing-gradient pathology of vanilla RNNs comes from the chain-rule product picking up one fresh shrinking factor per step. The LSTM’s cell-state path has no shrinking factor at all when forget gates are near \(1\) — each step’s cell-state Jacobian is a near-diagonal matrix with diagonal \(\approx 1\), and a product of such matrices stays near the identity.

The catch is that the network has to learn to set \(f_t \approx 1\) when long-range memory is needed and \(f_t < 1\) when forgetting is appropriate. This is what initialization tricks like \(b_f = 1\) are for: they bias the gates toward the identity-like regime so that gradients flow at the start of training, and the network can adjust them later.

What Gating Does Not Do

Gating does not:

Eliminate exploding gradients. Forget gates can be greater than the multiplicative threshold for some directions in some configurations, and the candidate \(\tilde c_t\) contribution can grow. Gradient clipping is still useful for LSTMs.
Make gradients well-conditioned. The gradient may not vanish, but it can still be very anisotropic across coordinates. Adaptive optimizers like Adam help.
Solve the parallelism problem. LSTMs are still sequential — the recurrence must be unrolled in time order for both forward and backward passes. This is why transformers replaced LSTMs at scale.

The contribution of gating is specifically and exactly the additive identity path through the cell state. The same idea — additive path with near-identity Jacobian — reappears in residual connections, where it solves the depth-direction analog of the same problem. \(\square\)

Connection to Residual Networks

The cell-state recurrence \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t\) has the same additive structure as a residual block \(h_{t} = h_{t-1} + F(h_{t-1})\). In both cases the Jacobian has the form \(I + \text{something}\), ensuring a direct path for gradients. (proof of the residual case) The conceptual unity of these mechanisms — additive paths to preserve gradient flow — is one of the lasting design lessons of the deep-learning era.

References

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.