Forward-Backward Algorithm
Motivation
Given a hidden Markov model with parameters \(\theta\) and an observation sequence \(x_{1:T}\), two quantities are central:
- The likelihood \(p_\theta(x_{1:T})\), used to evaluate models and as the objective in parameter learning.
- The smoothed posteriors \(p_\theta(z_t \mid x_{1:T})\) over hidden states at each time step, used for downstream prediction and as the E-step of Baum-Welch.
Naively summing over all \(K^T\) state paths is intractable. The forward-backward algorithm (Baum and Petrie 1966) exploits the chain structure of the HMM to compute both quantities in \(O(K^2 T)\) time using two passes: a forward pass that handles filtering and the likelihood, and a backward pass that, combined with the forward, gives smoothing.
Problem
Given an HMM with \(K\) hidden states, initial distribution \(\pi\), transition matrix \(A\), and emission probabilities \(B_k(x)\), plus an observation sequence \(x_{1:T}\):
- Compute the marginal likelihood \(p_\theta(x_{1:T})\).
- Compute the smoothed posteriors \(\gamma_t(k) = p_\theta(z_t = k \mid x_{1:T})\) for every \(t\) and \(k\).
- Compute the pairwise smoothed posteriors \(\xi_t(j, k) = p_\theta(z_t = j, z_{t+1} = k \mid x_{1:T})\), the sufficient statistics for parameter learning.
Key Ideas
Two messages on the chain
The HMM is a chain factor graph, and inference on it is the sum-product algorithm. Forward-backward names the two messages that flow along the chain in opposite directions.
The forward variable collects evidence from the past observations: \[ \alpha_t(k) = p_\theta(x_{1:t}, z_t = k). \]
The backward variable collects evidence from the future observations: \[ \beta_t(k) = p_\theta(x_{t+1:T} \mid z_t = k). \]
Each variable depends only on its immediate neighbor through a local matrix-vector product — the chain structure means no message has to consider more than two adjacent time steps at once.
Filtering vs. smoothing
The forward pass alone gives the filter: \(p_\theta(z_t \mid x_{1:t})\), the best estimate of the hidden state given only observations up to \(t\). This is what one runs in a real-time setting where future data isn’t available yet.
The forward and backward variables combined give the smoother: \(p_\theta(z_t \mid x_{1:T})\), the best estimate of the hidden state given the entire sequence. Smoothing reads better than filtering because it uses information from both directions; it requires the full sequence in memory and is therefore an offline computation.
Sum-product on a chain is the general pattern
Forward-backward is one instance of belief propagation (sum-product) on a chain. The same structure on a tree gives the tree sum-product algorithm; on a general graph it generalizes to junction-tree inference (exact, exponential in treewidth) or loopy belief propagation (approximate). The forward and backward passes are the two messages flowing between adjacent variables in the chain.
Deriving the Solution
The forward variable satisfies a recursion derived by marginalizing over the previous state: \[ \alpha_1(k) = \pi_k B_k(x_1), \qquad \alpha_t(k) = B_k(x_t) \sum_{j=1}^K \alpha_{t-1}(j) A_{jk}. \]
The backward variable satisfies a symmetric backward recursion: \[ \beta_T(k) = 1, \qquad \beta_t(k) = \sum_{j=1}^K A_{kj} B_j(x_{t+1}) \beta_{t+1}(j). \]
Two outputs from the forward pass alone:
- Likelihood: \(p_\theta(x_{1:T}) = \sum_k \alpha_T(k)\).
- Filtering posterior: \(p_\theta(z_t \mid x_{1:t}) \propto \alpha_t(k)\) — the forward variable, normalized over \(k\).
The smoothed single-state posterior combines the two passes:
\[ \gamma_t(k) = p_\theta(z_t = k \mid x_{1:T}) = \frac{\alpha_t(k) \beta_t(k)}{\sum_j \alpha_t(j) \beta_t(j)}. \]
The numerator is \(\alpha_t(k) \beta_t(k) = p_\theta(x_{1:t}, z_t = k) \cdot p_\theta(x_{t+1:T} \mid z_t = k) = p_\theta(x_{1:T}, z_t = k)\); the denominator is \(p_\theta(x_{1:T})\), which equals \(\sum_k \alpha_T(k)\) (any time-slice gives the same total).
The smoothed pairwise posterior, needed for the EM transition update, is
\[ \xi_t(j, k) = p_\theta(z_t = j, z_{t+1} = k \mid x_{1:T}) \propto \alpha_t(j) A_{jk} B_k(x_{t+1}) \beta_{t+1}(k), \]
normalized over \((j, k)\) to sum to \(1\). These are the sufficient statistics for the M-step in Baum-Welch.
Algorithm
# Forward pass
for k = 1..K:
α[1][k] = π[k] * B[k](x[1])
for t = 2..T, k = 1..K:
α[t][k] = B[k](x[t]) * Σ_j α[t-1][j] * A[j][k]
likelihood = Σ_k α[T][k]
# Backward pass
for k = 1..K:
β[T][k] = 1
for t = T-1..1, k = 1..K:
β[t][k] = Σ_j A[k][j] * B[j](x[t+1]) * β[t+1][j]
# Smoothed marginals
for t = 1..T, k = 1..K:
γ[t][k] = α[t][k] * β[t][k] / likelihood
# Pairwise marginals (if needed)
for t = 1..T-1, j = 1..K, k = 1..K:
ξ[t][j][k] = α[t][j] * A[j][k] * B[k](x[t+1]) * β[t+1][k] / likelihood
The key step is the local recursion: each \(\alpha_t\) is one matrix-vector product through \(A\) followed by elementwise multiplication by \(B(x_t)\); \(\beta_t\) is symmetric.
Walkthrough
A two-state HMM with three observations
Two hidden states (Hot, Cold), three observation values \(\{1, 2, 3\}\). Parameters:
- \(\pi = (0.6, 0.4)\).
- \(A = \begin{pmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{pmatrix}\) (rows: from H, from C).
- Emissions: \(B_H = (0.2, 0.4, 0.4)\), \(B_C = (0.5, 0.4, 0.1)\).
Observation sequence: \(x = (3, 1, 3)\).
Forward pass.
| \(t\) | \(\alpha_t(H)\) | \(\alpha_t(C)\) |
|---|---|---|
| 1 | \(\pi_H B_H(3) = 0.6 \cdot 0.4 = 0.240\) | \(\pi_C B_C(3) = 0.4 \cdot 0.1 = 0.040\) |
| 2 | \(B_H(1) [\alpha_1(H) A_{HH} + \alpha_1(C) A_{CH}] = 0.2 \cdot 0.184 = 0.0368\) | \(B_C(1) [\alpha_1(H) A_{HC} + \alpha_1(C) A_{CC}] = 0.5 \cdot 0.096 = 0.0480\) |
| 3 | \(B_H(3) [\alpha_2(H) A_{HH} + \alpha_2(C) A_{CH}] = 0.4 \cdot 0.04496 = 0.017984\) | \(B_C(3) [\alpha_2(H) A_{HC} + \alpha_2(C) A_{CC}] = 0.1 \cdot 0.03984 = 0.003984\) |
Likelihood: \(p(x) = \alpha_3(H) + \alpha_3(C) \approx 0.02197\).
Backward pass.
| \(t\) | \(\beta_t(H)\) | \(\beta_t(C)\) |
|---|---|---|
| 3 | \(1\) | \(1\) |
| 2 | \(A_{HH} B_H(3) + A_{HC} B_C(3) = 0.28 + 0.03 = 0.31\) | \(A_{CH} B_H(3) + A_{CC} B_C(3) = 0.16 + 0.06 = 0.22\) |
| 1 | \(A_{HH} B_H(1) \beta_2(H) + A_{HC} B_C(1) \beta_2(C) = 0.0434 + 0.033 = 0.0764\) | \(A_{CH} B_H(1) \beta_2(H) + A_{CC} B_C(1) \beta_2(C) = 0.0248 + 0.066 = 0.0908\) |
Smoothed posteriors.
| \(t\) | \(\gamma_t(H)\) | \(\gamma_t(C)\) |
|---|---|---|
| 1 | \(0.240 \cdot 0.0764 / 0.02197 \approx 0.835\) | \(0.040 \cdot 0.0908 / 0.02197 \approx 0.165\) |
| 2 | \(0.0368 \cdot 0.31 / 0.02197 \approx 0.519\) | \(0.0480 \cdot 0.22 / 0.02197 \approx 0.481\) |
| 3 | \(0.017984 \cdot 1 / 0.02197 \approx 0.819\) | \(0.003984 \cdot 1 / 0.02197 \approx 0.181\) |
Each row of \(\gamma\) sums to \(1\) as a probability distribution should. Observations \(3\) are well-explained by the Hot state (emission probability \(0.4\) vs \(0.1\) from Cold), so \(\gamma_1\) and \(\gamma_3\) both put most mass on H. At \(t = 2\) the observed \(1\) favors C, but the surrounding \(3\)s pull back through the smoothing — leaving the posterior near \(50/50\).
Correctness
The forward recursion follows from marginalizing \(p(x_{1:t}, z_t = k)\) over \(z_{t-1}\): \[ \alpha_t(k) = \sum_j p(x_{1:t-1}, z_{t-1} = j) \, p(z_t = k \mid z_{t-1} = j) \, p(x_t \mid z_t = k) = B_k(x_t) \sum_j \alpha_{t-1}(j) A_{jk}. \]
The backward recursion follows symmetrically by marginalizing \(p(x_{t+1:T} \mid z_t = k)\) over \(z_{t+1}\). The smoothing identity \(\gamma_t(k) \propto \alpha_t(k) \beta_t(k)\) follows from the HMM conditional independence \(x_{1:t} \perp x_{t+1:T} \mid z_t\): the past and future observations decouple given the current state, so their joint density factors into \(\alpha\) and \(\beta\).
Complexity and Tradeoffs
Time \(O(K^2 T)\), space \(O(KT)\) to store \(\alpha\) and \(\beta\). Each of the \(T\) steps is a \(K \times K\) matrix-vector product. The space cost matters in long-sequence applications — gene-length DNA sequences can need streaming variants that trade time for space.
Numerical stability
For long sequences \(\alpha_t(k)\) underflows to zero in floating point — each step multiplies by emission probabilities \(\le 1\). Two standard fixes:
- Rescaling. At each \(t\), divide \(\alpha_t(\cdot)\) by \(c_t = \sum_k \alpha_t(k)\), so \(\alpha_t\) becomes the filtering posterior. Track the scale factors; the log-likelihood is \(\log p_\theta(x_{1:T}) = \sum_t \log c_t\). Apply the same scale factors to \(\beta_t\) to keep \(\gamma_t\) correct.
- Log-space. Compute \(\log \alpha_t\) directly using the log-sum-exp trick, \(\log \sum_j e^{a_j} = a^* + \log \sum_j e^{a_j - a^*}\) with \(a^* = \max_j a_j\). Slower than rescaling because of the elementwise exponentials, but trivially correct and what most reference implementations use.
When to Use It
| Situation | Approach |
|---|---|
| Likelihood evaluation or smoothed posteriors on an HMM | Forward-backward. |
| Online filtering only | Run the forward pass alone — don’t need \(\beta\). |
| Most likely state sequence (rather than per-state marginals) | Viterbi — max-product instead of sum-product. |
| HMM parameter learning | Forward-backward inside Baum-Welch. |
| Continuous-state linear-Gaussian model | Kalman filter (forward) + RTS smoother (backward). |
| Tree-structured model (not a chain) | Generalize to belief propagation on the tree. |