Viterbi Returns the Most Likely State Path

Claim

Let \(\delta_t(k) = \max_{z_{1:t-1}} p_\theta(z_{1:t-1}, z_t = k, x_{1:t})\), computed by the recursion

\[ \delta_1(k) = \pi_k B_k(x_1), \qquad \delta_t(k) = B_k(x_t) \max_{j} \left[\delta_{t-1}(j) A_{jk}\right], \]

with backpointers \(\psi_t(k) = \arg\max_j [\delta_{t-1}(j) A_{jk}]\). Then the path reconstructed by

\[ z_T^* = \arg\max_k \delta_T(k), \qquad z_{t-1}^* = \psi_t(z_t^*), \]

satisfies

\[ z_{1:T}^* = \operatorname*{arg\,max}_{z_{1:T}} p_\theta(z_{1:T}, x_{1:T}) \]

(Viterbi 1967).

Proof

The argument has two parts: the recursion correctly computes \(\delta_t\), and backtracking the argmaxes recovers a maximizing path.

Part 1: \(\delta_t\) satisfies its definition.

By induction on \(t\).

Base case (\(t = 1\)). \(\delta_1(k) = p_\theta(z_1 = k, x_1) = \pi_k B_k(x_1)\). ✓

Inductive step. Assume \(\delta_{t-1}(j) = \max_{z_{1:t-2}} p_\theta(z_{1:t-2}, z_{t-1} = j, x_{1:t-1})\) for every \(j\). The HMM joint factorizes, so any path ending at \(z_t = k\) can be written

\[ p_\theta(z_{1:t-1}, z_t = k, x_{1:t}) = p_\theta(z_{1:t-2}, z_{t-1} = z_{t-1}, x_{1:t-1}) \cdot A_{z_{t-1}, k} \cdot B_k(x_t). \]

Maximizing over \(z_{1:t-1} = (z_{1:t-2}, z_{t-1})\) separates:

\[ \max_{z_{1:t-1}} p_\theta(z_{1:t-1}, z_t = k, x_{1:t}) = B_k(x_t) \max_{j} \left[A_{jk} \cdot \max_{z_{1:t-2}} p_\theta(z_{1:t-2}, z_{t-1} = j, x_{1:t-1})\right]. \]

By the inductive hypothesis the inner max is \(\delta_{t-1}(j)\), giving exactly the recursion claimed. ✓

By induction \(\delta_T(k)\) is the maximum joint probability of any path ending in state \(k\), and so \(\max_k \delta_T(k) = \max_{z_{1:T}} p_\theta(z_{1:T}, x_{1:T})\).

Part 2: backtracking recovers a maximizing path.

For \(z_T^* = \arg\max_k \delta_T(k)\), the path of length \(T\) achieving \(\delta_T(z_T^*)\) has the form \((z_{1:T-1}^*, z_T^*)\) where \(z_{1:T-1}^*\) achieves the inner max. By the separation in Part 1, that inner max is achieved by \(z_{T-1}^* = \psi_T(z_T^*)\), with the prefix \(z_{1:T-2}^*\) in turn achieving \(\delta_{T-1}(z_{T-1}^*)\). Recursing,

\[ z_{t-1}^* = \psi_t(z_t^*) = \arg\max_j [\delta_{t-1}(j) A_{j, z_t^*}]. \]

The reconstructed path satisfies, at every step, the optimality condition for the path ending at \(z_t^*\) — which is the path of maximum joint probability ending there. Hence \(z_{1:T}^*\) achieves \(\max_{z_{1:T}} p_\theta(z_{1:T}, x_{1:T})\). \(\square\)

Where the Bellman Principle Enters

The induction in Part 1 is the principle of optimality: an optimal trajectory has optimal sub-trajectories. The chain factorization of the HMM joint is what makes this principle apply — each step’s contribution depends only on adjacent states, so optimizing globally decomposes into stage-by-stage maxes. Without the Markov structure (e.g., if \(p(z_t \mid z_{1:t-1})\) depended on the full history), the recursion would have to track an exponentially growing state.

Ties

If multiple \(j\) achieve the max in some \(\psi_t(k)\), the argmax is not unique — there are multiple optimal paths. The algorithm returns one of them (whichever the implementation’s argmax breaks toward); all have the same joint probability.

References

Viterbi, A. 1967. “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm.” IEEE Transactions on Information Theory 13 (2): 260–69. https://doi.org/10.1109/tit.1967.1054010.