Viterbi Algorithm
Motivation
Given a hidden Markov model and an observation sequence \(x_{1:T}\), the Viterbi algorithm (Viterbi 1967) finds the single most likely state path
\[ z_{1:T}^* = \operatorname*{arg\,max}_{z_{1:T}} p_\theta(z_{1:T}, x_{1:T}). \]
Equivalently, since \(p_\theta(z_{1:T} \mid x_{1:T}) \propto p_\theta(z_{1:T}, x_{1:T})\), this is the MAP state path. The naive search is over all \(K^T\) paths; Viterbi solves it exactly in \(O(K^2 T)\) time using dynamic programming.
The Viterbi path is not the same as the sequence of pointwise smoothing argmaxes from forward-backward. The smoothing argmax can yield a path with zero joint probability if any transition along it has \(A_{jk} = 0\). Viterbi optimizes the joint, which is the right thing for tasks like decoding a most-likely tag sequence.
Problem
Given HMM parameters \((\pi, A, B)\) and an observation sequence \(x_{1:T}\), return
\[ z_{1:T}^* = \operatorname*{arg\,max}_{z_{1:T} \in \{1, \ldots, K\}^T} p_\theta(z_{1:T}, x_{1:T}). \]
The output is a single state sequence — the joint MAP — and optionally its joint probability.
Key Ideas
Max-product instead of sum-product
The structural insight is that Viterbi is forward-backward with the sum replaced by a max. The forward recursion marginalizes over the previous state by summing; the Viterbi recursion commits to the best previous state by taking the maximum. Both run in \(O(K^2)\) per step because of the chain structure, but they answer different questions: the sum gives marginal evidence, the max gives the MAP path.
Bellman optimality on the trellis
Any optimal path passing through \(z_t = k\) has an optimal prefix ending at \(z_t = k\) — otherwise the path could be improved by replacing the prefix with the better one. So the maximum joint probability of any path ending at \((t, k)\) depends only on the maxes at \((t-1, j)\) for all \(j\). This is the standard DP-on-DAG argument: the trellis is a DAG, edges have multiplicative weights \(A_{jk} B_k(x_t)\), and Viterbi is shortest path in log-space.
Backpointers reconstruct the argmax
Computing only the max value loses the actual path. Recording at each \((t, k)\) which previous state \(j\) achieved the max (the backpointer \(\psi_t(k)\)) lets the algorithm reconstruct the optimal state sequence by walking backward from \(z_T^* = \arg\max_k \delta_T(k)\).
Deriving the Solution
Define the Viterbi variable
\[ \delta_t(k) = \max_{z_{1:t-1}} p_\theta(z_{1:t-1}, z_t = k, x_{1:t}). \]
This is the maximum joint probability of any path that ends in state \(k\) at time \(t\). By the Bellman argument it satisfies
\[ \delta_1(k) = \pi_k B_k(x_1), \qquad \delta_t(k) = B_k(x_t) \max_{j=1}^K \left[\delta_{t-1}(j) A_{jk}\right]. \]
Compare with the forward recursion: identical in form, with \(\sum_j\) replaced by \(\max_j\). The backpointer is the argmax:
\[ \psi_t(k) = \operatorname*{arg\,max}_{j=1}^K \left[\delta_{t-1}(j) A_{jk}\right]. \]
After the forward sweep, the optimal path is read off by termination and backtracking:
\[ z_T^* = \arg\max_k \delta_T(k), \qquad z_{t-1}^* = \psi_t(z_t^*). \]
The maximum joint probability is \(\max_k \delta_T(k)\).
Algorithm
# Initialize
for k in 1..K:
delta[1][k] = pi[k] * B_k(x[1])
psi[1][k] = 0
# Forward
for t in 2..T:
for k in 1..K:
delta[t][k] = B_k(x[t]) * max_j (delta[t-1][j] * A[j][k])
psi[t][k] = argmax_j (delta[t-1][j] * A[j][k])
# Termination
z[T] = argmax_k delta[T][k]
# Backtrack
for t in T-1..1:
z[t] = psi[t+1][z[t+1]]
The key step is the max-product update — the recurrence for \(\delta_t(k)\).
Walkthrough
Viterbi on a 3-state, 4-step trellis
Each cell \(\delta_t(k)\) holds the maximum joint probability of any path ending at state \(k\) at time \(t\). The argmax over \(\delta_T\) identifies the endpoint, and the backpointers trace the optimal path back through the trellis.
The final argmax over \(\delta_4\) is state \(3\) (the bottom red node), with joint probability \(0.04\). Following the backpointers from \(z_4^* = 3\) gives \(z_3^* = 1\), \(z_2^* = 1\), \(z_1^* = 2\). The optimal path is \(z^* = (2, 1, 1, 3)\) — highlighted in red. Note that at \(t = 1\) state \(2\) has the highest \(\delta\) value, but the path actually leaves state \(2\) at the next step because the transition \(2 \to 1\) paired with the emission at \(t = 2\) scores better than staying in state \(2\).
Correctness
The Bellman optimality of \(\delta_t\) is the standard dynamic-programming argument: any optimal path through \(z_t = k\) has an optimal prefix ending at \(z_t = k\), so \(\delta_t(k)\) truly is the max over the set claimed. The recursion respects this and the backpointers reconstruct the argmax. (proof)
Complexity and Tradeoffs
\(O(K^2 T)\) time and \(O(KT)\) space — identical to forward-backward, since the inner work is the same matrix-vector style update. The space cost is mainly the backpointer table; in some variants (e.g., when only the path probability is needed) backpointers can be omitted, saving the \(O(KT)\) table.
Numerical stability
As with forward-backward, \(\delta_t\) underflows for long sequences. The clean fix is to work in log space:
\[ \log \delta_t(k) = \log B_k(x_t) + \max_{j} \left[\log \delta_{t-1}(j) + \log A_{jk}\right]. \]
This is just shortest-path on a trellis with edge costs \(-\log A_{jk} - \log B_k(x_t)\). No log-sum-exp is required because Viterbi uses \(\max\), not \(\sum\) — log space is exact here, not approximate.
When to Use It
| Situation | Approach |
|---|---|
| Single most likely state path in an HMM | Viterbi. |
| Per-state marginals or model evidence | Forward-backward instead. |
| Path through a CRF | Viterbi extends directly; replace transition × emission with the CRF potentials. |
| Path through a very large label space (e.g., word vocabularies) | Beam search — see Variants. Loses optimality but scales. |
| Sequence with continuous latent state | Kalman filter + RTS smoother, or particle methods. |
| Single observation (no temporal structure) | Not applicable — Viterbi is only useful on sequences. |
Variants
- Sequence labeling with CRFs. The conditional random field replaces \(A_{jk} B_k(x_t)\) with arbitrary potentials \(\phi(z_{t-1}, z_t, x_{1:T})\); the inference algorithm is structurally identical.
- Beam search. Viterbi without storing all \(K\) candidates per step — keep only the top-\(B\). Loses optimality but scales to enormous label spaces (e.g., decoding a language model).
- Convolutional codes. Viterbi is the standard decoder for convolutional error-correcting codes; this is the application that gave it the name.
- A* on the trellis. With an admissible heuristic on remaining cost, A* can sometimes find the same MAP path while exploring fewer cells, useful when the trellis is huge but pruned heuristically.