Hidden Markov Models

Motivation

A hidden Markov model (HMM) (Baum and Petrie 1966) is the simplest interesting latent-variable model for sequences. The setting: at each time step \(t\), an unobserved discrete state \(z_t\) evolves according to a Markov chain, and an observation \(x_t\) is generated from a state-dependent emission distribution. The problem is to reason about the hidden states from the observations alone — for filtering (“what state am I in now?”), smoothing (“what state was I in at time \(t\), given everything I have seen?”), decoding (“what is the most likely state path?”), and parameter learning (“what transition and emission probabilities best explain this data?”).

HMMs were the dominant model for speech recognition through the 2000s and remain the standard probabilistic baseline for any discrete-state sequence task: part-of-speech tagging, gene-finding, ion-channel analysis, behavior segmentation. Conceptually they are the discrete-state cousin of Kalman filters and the foundational example for forward-backward inference and Baum-Welch parameter learning.

Model

Latent states \(z_1, \ldots, z_T \in \{1, \ldots, K\}\) and observations \(x_1, \ldots, x_T\). The model has three pieces:

  • Initial distribution \(\pi_k = P(z_1 = k)\).
  • Transition matrix \(A \in \mathbb{R}^{K \times K}\) with \(A_{jk} = P(z_t = k \mid z_{t-1} = j)\).
  • Emission distribution \(B\) with \(B_k(x) = P(x_t = x \mid z_t = k)\). Discrete observations give \(B\) a \(K \times M\) matrix; continuous observations give a parametric family per state (commonly Gaussian).

The joint factorizes as

\[ p(x_{1:T}, z_{1:T}) = \pi_{z_1} B_{z_1}(x_1) \prod_{t=2}^{T} A_{z_{t-1}, z_t} B_{z_t}(x_t). \]

The model parameters are \(\theta = (\pi, A, B)\).

Diagram: a 3-state, 4-step trellis

Each column is a time step; each row is a possible state. Solid arrows show one possible state transition path; the dashed arrows beneath each state are the emissions \(x_t\).

t = 1 t = 2 t = 3 t = 4 state 1 state 2 state 3 x₁ x₂ x₃ x₄ Light gray edges: every transition has some probability A_{jk}. Highlighted: one possible path z = (2, 2, 3, 1). Each emission x_t is drawn from B_{z_t}, depending only on the current hidden state.

Conditional Independence

The defining structure is two Markov properties:

  1. State Markov: \(z_t \perp z_{1:t-2} \mid z_{t-1}\). The next state depends only on the current state, not history.
  2. Observation independence: \(x_t \perp x_{\neq t}, z_{\neq t} \mid z_t\). The current observation depends only on the current state.

Together these give the chain-and-observation graphical structure that makes inference tractable: messages only need to flow forward and backward in time once.

Three Canonical Problems

Following Rabiner’s tutorial, three inference problems define what an HMM is good for:

  1. Likelihood — compute \(p_\theta(x_{1:T})\). Solved by the forward pass in \(O(K^2 T)\). Naively summing over all \(K^T\) state paths is intractable.
  2. Decoding — find \(\arg\max_{z_{1:T}} p_\theta(z_{1:T} \mid x_{1:T})\), the single most likely state path. Solved by the Viterbi algorithm.
  3. Parameter learning — find \(\arg\max_\theta p_\theta(x_{1:T})\) from a set of observation sequences. Solved by Baum-Welch, the EM specialization for HMMs.

The smoothing problem — compute \(p_\theta(z_t \mid x_{1:T})\) for every \(t\) — is the forward-backward algorithm and is a building block for both the likelihood and Baum-Welch.

Filtering vs. Smoothing vs. Decoding

These often get confused.

  • Filtering computes \(p(z_t \mid x_{1:t})\) — the current state given everything seen so far. It is causal and online, run in real time. The forward pass alone gives this.
  • Smoothing computes \(p(z_t \mid x_{1:T})\) — the state at time \(t\) given the entire sequence, including future observations. It is offline. Forward + backward gives this.
  • Decoding (Viterbi) computes the single most-likely joint state path \(\arg\max_{z_{1:T}} p(z_{1:T} \mid x_{1:T})\), which is generally not the sequence of pointwise smoothing argmaxes. Choosing the most likely state at each time independently can produce a path with zero joint probability if a transition along it has \(A_{jk} = 0\).

Limitations

The Markov assumption is the obvious one: the model has no memory beyond the current state. To capture long-range dependence one must either grow the state space (which scales exponentially) or move to a different sequence model. The discrete-state assumption is the other: continuous-state versions are linear-Gaussian state-space models (Kalman filters) or, in the deep-learning era, recurrent neural networks and transformers. The conceptual core — latent states, emission, forward-backward inference — survives all of these generalizations.

References

Baum, Leonard E., and Ted Petrie. 1966. “Statistical Inference for Probabilistic Functions of Finite State Markov Chains.” The Annals of Mathematical Statistics 37 (6): 1554–63. https://doi.org/10.1214/aoms/1177699147.