Baum-Welch Algorithm
Motivation
Baum-Welch is the algorithm that attempts to maximize the likelihood of the parameters \(\theta = (\pi, A, B)\) of a hidden Markov model given one or more observation sequences but no labels for the hidden states (Baum et al. 1970). It is the specialization of expectation-maximization to HMMs: the E-step is forward-backward, the M-step is closed-form weighted maximum-likelihood updates.
Historically Baum-Welch predates the general EM framework — Baum and his collaborators worked it out for HMMs in the late 1960s and early 1970s. Dempster, Laird, and Rubin’s 1977 EM paper (Dempster et al. 1977) subsumed it. The HMM-specific name has stuck because the closed-form M-step is unusually clean.
Problem
Given one or more observation sequences \(x_{1:T}\) from an HMM with \(K\) hidden states but no labels for the hidden states, find parameters \(\theta = (\pi, A, B)\) that locally maximize \(\log p_\theta(x_{1:T})\).
The likelihood marginalizes over all \(K^T\) hidden state sequences, making direct maximization intractable. Baum-Welch uses the EM structure to climb the likelihood surface monotonically without ever evaluating the marginal directly.
Key Ideas
EM as coordinate ascent
The hidden state sequence \(z_{1:T}\) is unknown, but conditioning on \(\theta^{(t)}\) produces a tractable posterior \(p_{\theta^{(t)}}(z_{1:T} \mid x_{1:T})\). The trick is to alternate:
- E-step. Holding \(\theta^{(t)}\) fixed, compute the posterior over hidden states. The HMM chain structure makes this a single forward-backward pass.
- M-step. Holding the posterior fixed, maximize the expected complete-data log-likelihood over \(\theta\). With HMM parameters, this is closed-form: every update is a weighted MLE.
This is coordinate ascent on the ELBO: the E-step closes the gap between the ELBO and the log-likelihood by setting the variational posterior to the exact posterior; the M-step then raises the ELBO. Each iteration cannot decrease the log-likelihood.
Soft counts on the trellis
The E-step turns each hidden-state node and each transition in the HMM trellis into a soft count — \(\gamma_t(k)\) is the expected number of times we were in state \(k\) at time \(t\), and \(\xi_t(j, k)\) is the expected number of transitions from \(j\) to \(k\) at time \(t\). The M-step then renormalizes these counts into a new probability vector for \(\pi\), transition matrix \(A\), and emissions \(B\).
Deriving the Solution
E-step
Run forward-backward at \(\theta^{(t)}\) to compute the smoothed single-state and pairwise posteriors:
\[ \gamma_t(k) = p_{\theta^{(t)}}(z_t = k \mid x_{1:T}) = \frac{\alpha_t(k) \beta_t(k)}{\sum_j \alpha_t(j) \beta_t(j)}, \]
\[ \xi_t(j, k) = p_{\theta^{(t)}}(z_t = j, z_{t+1} = k \mid x_{1:T}) = \frac{\alpha_t(j) A^{(t)}_{jk} B^{(t)}_k(x_{t+1}) \beta_{t+1}(k)}{\sum_{j', k'} \alpha_t(j') A^{(t)}_{j'k'} B^{(t)}_{k'}(x_{t+1}) \beta_{t+1}(k')}. \]
These are the sufficient statistics for the M-step.
M-step
Maximize \(Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{z_{1:T} \mid x_{1:T}, \theta^{(t)}}[\log p_\theta(x_{1:T}, z_{1:T})]\). Substituting the HMM joint and using Lagrange multipliers for the normalization constraints yields
\[ \pi^{(t+1)}_k = \gamma_1(k), \]
\[ A^{(t+1)}_{jk} = \frac{\sum_{t=1}^{T-1} \xi_t(j, k)}{\sum_{t=1}^{T-1} \gamma_t(j)}. \]
(The denominator \(\sum_{t=1}^{T-1} \gamma_t(j)\) equals \(\sum_{t=1}^{T-1} \sum_k \xi_t(j,k)\) because \(\sum_k \xi_t(j,k) = \gamma_t(j)\) for \(t < T\).)
For discrete emissions over an alphabet of size \(M\),
\[ B^{(t+1)}_k(m) = \frac{\sum_{t=1}^T \gamma_t(k) \, \mathbb{1}[x_t = m]}{\sum_{t=1}^T \gamma_t(k)}. \]
For Gaussian emissions \(x_t \mid z_t = k \sim \mathcal{N}(\mu_k, \Sigma_k)\),
\[ \mu^{(t+1)}_k = \frac{\sum_{t=1}^T \gamma_t(k) x_t}{\sum_{t=1}^T \gamma_t(k)}, \qquad \Sigma^{(t+1)}_k = \frac{\sum_{t=1}^T \gamma_t(k) (x_t - \mu^{(t+1)}_k)(x_t - \mu^{(t+1)}_k)^\top}{\sum_{t=1}^T \gamma_t(k)}. \]
The pattern is consistent: every update is a weighted MLE under weights given by the responsibilities \(\gamma\) (or \(\xi\) for transitions). This is what EM looks like in any exponential-family mixture; see the GMM example in the EM page.
Algorithm
initialize θ = (π, A, B)
repeat until log-likelihood converges:
# E-step
run forward-backward on x[1:T] with current θ
compute γ[t][k] and ξ[t][j][k] for all t, j, k
# M-step
π[k] = γ[1][k]
A[j][k] = sum_t ξ[t][j][k] / sum_t γ[t][j] (t = 1..T-1)
B[k](m) = sum_{t: x[t]=m} γ[t][k] / sum_t γ[t][k] (discrete)
# or Gaussian / exponential-family MLE under weights γ[t][k]
Each iteration is one E-step (cost \(O(K^2 T)\)) plus one M-step (cost \(O(K^2 T)\) for the transition update). The number of iterations is governed by the convergence threshold and is highly dataset-dependent.
Walkthrough
One Baum-Welch iteration on a 2-state HMM
Use the HMM parameters and observation sequence from the forward-backward walkthrough:
- States: H (Hot), C (Cold). \(\pi = (0.6, 0.4)\), \(A_{HH}=0.7,\ A_{HC}=0.3,\ A_{CH}=0.4,\ A_{CC}=0.6\). Emissions \(B_H = (0.2, 0.4, 0.4)\), \(B_C = (0.5, 0.4, 0.1)\) over symbols \(\{1, 2, 3\}\).
- Observation: \(x = (3, 1, 3)\).
E-step. Forward-backward gives (from the worked example):
| \(t\) | \(\gamma_t(H)\) | \(\gamma_t(C)\) |
|---|---|---|
| 1 | 0.835 | 0.165 |
| 2 | 0.519 | 0.481 |
| 3 | 0.819 | 0.181 |
Pairwise posteriors \(\xi_t(j, k)\) for \(t = 1, 2\) (each row sums to \(\gamma_t(j)\)):
| \(t\) | \(\xi_t(H,H)\) | \(\xi_t(H,C)\) | \(\xi_t(C,H)\) | \(\xi_t(C,C)\) |
|---|---|---|---|---|
| 1 | 0.476 | 0.359 | 0.043 | 0.122 |
| 2 | 0.452 | 0.067 | 0.367 | 0.114 |
M-step.
- New initial: \(\pi^{\text{new}}_H = \gamma_1(H) = 0.835\), \(\pi^{\text{new}}_C = 0.165\).
- New transitions: \(A^{\text{new}}_{HH} = (0.476 + 0.452) / (0.835 + 0.519) = 0.928 / 1.354 \approx 0.685\). Similarly \(A^{\text{new}}_{HC} \approx 0.315\), \(A^{\text{new}}_{CH} \approx 0.634\), \(A^{\text{new}}_{CC} \approx 0.366\).
- New emissions: \(B^{\text{new}}_H(3) = (\gamma_1(H) + \gamma_3(H)) / (\gamma_1(H) + \gamma_2(H) + \gamma_3(H)) = (0.835 + 0.819) / 2.173 \approx 0.761\). \(B^{\text{new}}_H(1) = \gamma_2(H) / 2.173 \approx 0.239\). \(B^{\text{new}}_C(3) = (\gamma_1(C) + \gamma_3(C)) / (\gamma_1(C) + \gamma_2(C) + \gamma_3(C)) = (0.165 + 0.181) / 0.827 \approx 0.418\). \(B^{\text{new}}_C(1) \approx 0.582\).
After one iteration, the H state has become more strongly associated with emission \(3\) (was \(0.4\), now \(0.76\)), and \(A_{HH}\) has shrunk slightly because the trellis evidence pulls toward switching at \(t = 2\). With more iterations on this single short sequence the parameters would continue to drift toward whatever local optimum best explains \(x = (3, 1, 3)\).
Correctness
Each iteration is guaranteed to monotonically non-decrease \(\log p_\theta(x_{1:T})\) (assuming exact E-step/M-step and exact arithmetic). (proof) This follows from the general EM convergence proof: the E-step makes the variational posterior equal the exact posterior, and the M-step then maximizes the resulting Q-function, which is a lower bound on the log-likelihood that is tight at \(\theta^{(t)}\).
The likelihood surface for HMMs is non-convex with many local optima, so initialization matters and random restarts are standard. Common starting points: random parameters, \(k\)-means on the observations to seed emissions, or a smaller HMM bootstrapped to a larger one.
Complexity and Tradeoffs
Each iteration: \(O(K^2 T)\) per sequence for both E-step and M-step. Memory: \(O(KT)\) for the forward-backward tables. The number of iterations is governed by the convergence threshold and the difficulty of the likelihood surface — anywhere from tens to hundreds of iterations is typical.
Practical notes
- Multiple sequences. Sum the sufficient statistics \(\sum_t \xi_t\) and \(\sum_t \gamma_t\) across all training sequences; the M-step formulas for \(A\) and \(B\) are unchanged. For \(\pi\), average rather than sum: \(\pi^{\text{new}}_k = \frac{1}{S}\sum_{s=1}^{S} \gamma_1^{(s)}(k)\), so the result is a valid probability vector.
- Numerical stability. Use the rescaling or log-space variant of forward-backward — long sequences underflow \(\alpha\) and \(\beta\) otherwise.
- Stopping criterion. Halt when the relative change in log-likelihood per iteration drops below a threshold. The likelihood is also useful to watch for bugs: it must not decrease, and decreasing values in floating point indicate numerical issues.
- Identifiability. HMM parameters are only identifiable up to a permutation of the state labels. This is a non-issue for likelihood evaluation but means trained parameters from different runs cannot be compared componentwise without alignment.
- Smoothing. Adding pseudocounts — equivalent to a Dirichlet prior with positive hyperparameters — to \(\pi\), \(A\), and discrete \(B\) avoids zero probabilities for unseen transitions or symbols, which would otherwise propagate \(-\infty\) log-likelihoods on test data.
When to Use It
| Situation | Approach |
|---|---|
| Unsupervised HMM training (no state labels) | Baum-Welch. |
| Supervised HMM training (state labels available) | Maximum-likelihood from counts directly — no EM needed. |
| Partial labels available | Initialize from labeled counts, refine with Baum-Welch. |
| Multiple training sequences | Baum-Welch with summed sufficient statistics across sequences. |
| Stuck in a poor local optimum | Random restarts, \(k\)-means initialization for emissions, or replace HMM with a discriminative model. |
| Modern sequence modeling (speech, NLP) | Recurrent neural networks and transformers often replace HMMs end-to-end; the EM coordinate-ascent perspective survives in variational methods including the VAE. |