Baum-Welch Algorithm

Motivation

Baum-Welch is the algorithm that attempts to maximize the likelihood of the parameters \(\theta = (\pi, A, B)\) of a hidden Markov model given one or more observation sequences but no labels for the hidden states (Baum et al. 1970). It is the specialization of expectation-maximization to HMMs: the E-step is forward-backward, the M-step is closed-form weighted maximum-likelihood updates.

Historically Baum-Welch predates the general EM framework — Baum and his collaborators worked it out for HMMs in the late 1960s and early 1970s. Dempster, Laird, and Rubin’s 1977 EM paper (Dempster et al. 1977) subsumed it. The HMM-specific name has stuck because the closed-form M-step is unusually clean.

Problem

Given one or more observation sequences \(x_{1:T}\) from an HMM with \(K\) hidden states but no labels for the hidden states, find parameters \(\theta = (\pi, A, B)\) that locally maximize \(\log p_\theta(x_{1:T})\).

The likelihood marginalizes over all \(K^T\) hidden state sequences, making direct maximization intractable. Baum-Welch uses the EM structure to climb the likelihood surface monotonically without ever evaluating the marginal directly.

Key Ideas

EM as coordinate ascent

The hidden state sequence \(z_{1:T}\) is unknown, but conditioning on \(\theta^{(t)}\) produces a tractable posterior \(p_{\theta^{(t)}}(z_{1:T} \mid x_{1:T})\). The trick is to alternate:

E-step. Holding \(\theta^{(t)}\) fixed, compute the posterior over hidden states. The HMM chain structure makes this a single forward-backward pass.
M-step. Holding the posterior fixed, maximize the expected complete-data log-likelihood over \(\theta\). With HMM parameters, this is closed-form: every update is a weighted MLE.

This is coordinate ascent on the ELBO: the E-step closes the gap between the ELBO and the log-likelihood by setting the variational posterior to the exact posterior; the M-step then raises the ELBO. Each iteration cannot decrease the log-likelihood.

Soft counts on the trellis

The E-step turns each hidden-state node and each transition in the HMM trellis into a soft count — \(\gamma_t(k)\) is the expected number of times we were in state \(k\) at time \(t\), and \(\xi_t(j, k)\) is the expected number of transitions from \(j\) to \(k\) at time \(t\). The M-step then renormalizes these counts into a new probability vector for \(\pi\), transition matrix \(A\), and emissions \(B\).

Deriving the Solution

E-step

Run forward-backward at \(\theta^{(t)}\) to compute the smoothed single-state and pairwise posteriors:

\[ \gamma_t(k) = p_{\theta^{(t)}}(z_t = k \mid x_{1:T}) = \frac{\alpha_t(k) \beta_t(k)}{\sum_j \alpha_t(j) \beta_t(j)}, \]

\[ \xi_t(j, k) = p_{\theta^{(t)}}(z_t = j, z_{t+1} = k \mid x_{1:T}) = \frac{\alpha_t(j) A^{(t)}_{jk} B^{(t)}_k(x_{t+1}) \beta_{t+1}(k)}{\sum_{j', k'} \alpha_t(j') A^{(t)}_{j'k'} B^{(t)}_{k'}(x_{t+1}) \beta_{t+1}(k')}. \]

These are the sufficient statistics for the M-step.

M-step

Maximize \(Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{z_{1:T} \mid x_{1:T}, \theta^{(t)}}[\log p_\theta(x_{1:T}, z_{1:T})]\). Substituting the HMM joint and using Lagrange multipliers for the normalization constraints yields

\[ \pi^{(t+1)}_k = \gamma_1(k), \]

\[ A^{(t+1)}_{jk} = \frac{\sum_{t=1}^{T-1} \xi_t(j, k)}{\sum_{t=1}^{T-1} \gamma_t(j)}. \]

(The denominator \(\sum_{t=1}^{T-1} \gamma_t(j)\) equals \(\sum_{t=1}^{T-1} \sum_k \xi_t(j,k)\) because \(\sum_k \xi_t(j,k) = \gamma_t(j)\) for \(t < T\).)

For discrete emissions over an alphabet of size \(M\),

\[ B^{(t+1)}_k(m) = \frac{\sum_{t=1}^T \gamma_t(k) \, \mathbb{1}[x_t = m]}{\sum_{t=1}^T \gamma_t(k)}. \]

For Gaussian emissions \(x_t \mid z_t = k \sim \mathcal{N}(\mu_k, \Sigma_k)\),

\[ \mu^{(t+1)}_k = \frac{\sum_{t=1}^T \gamma_t(k) x_t}{\sum_{t=1}^T \gamma_t(k)}, \qquad \Sigma^{(t+1)}_k = \frac{\sum_{t=1}^T \gamma_t(k) (x_t - \mu^{(t+1)}_k)(x_t - \mu^{(t+1)}_k)^\top}{\sum_{t=1}^T \gamma_t(k)}. \]

The pattern is consistent: every update is a weighted MLE under weights given by the responsibilities \(\gamma\) (or \(\xi\) for transitions). This is what EM looks like in any exponential-family mixture; see the GMM example in the EM page.

Algorithm

initialize θ = (π, A, B)
repeat until log-likelihood converges:
    # E-step
    run forward-backward on x[1:T] with current θ
    compute γ[t][k] and ξ[t][j][k] for all t, j, k

    # M-step
    π[k]     = γ[1][k]
    A[j][k]  = sum_t ξ[t][j][k] / sum_t γ[t][j]                     (t = 1..T-1)
    B[k](m)  = sum_{t: x[t]=m} γ[t][k] / sum_t γ[t][k]              (discrete)
    # or Gaussian / exponential-family MLE under weights γ[t][k]

Each iteration is one E-step (cost \(O(K^2 T)\)) plus one M-step (cost \(O(K^2 T)\) for the transition update). The number of iterations is governed by the convergence threshold and is highly dataset-dependent.

Walkthrough

One Baum-Welch iteration on a 2-state HMM

Use the HMM parameters and observation sequence from the forward-backward walkthrough:

States: H (Hot), C (Cold). \(\pi = (0.6, 0.4)\), \(A_{HH}=0.7,\ A_{HC}=0.3,\ A_{CH}=0.4,\ A_{CC}=0.6\). Emissions \(B_H = (0.2, 0.4, 0.4)\), \(B_C = (0.5, 0.4, 0.1)\) over symbols \(\{1, 2, 3\}\).
Observation: \(x = (3, 1, 3)\).

E-step. Forward-backward gives (from the worked example):

\(t\)	\(\gamma_t(H)\)	\(\gamma_t(C)\)
1	0.835	0.165
2	0.519	0.481
3	0.819	0.181

Pairwise posteriors \(\xi_t(j, k)\) for \(t = 1, 2\) (each row sums to \(\gamma_t(j)\)):

\(t\)	\(\xi_t(H,H)\)	\(\xi_t(H,C)\)	\(\xi_t(C,H)\)	\(\xi_t(C,C)\)
1	0.476	0.359	0.043	0.122
2	0.452	0.067	0.367	0.114

M-step.

New initial: \(\pi^{\text{new}}_H = \gamma_1(H) = 0.835\), \(\pi^{\text{new}}_C = 0.165\).
New transitions: \(A^{\text{new}}_{HH} = (0.476 + 0.452) / (0.835 + 0.519) = 0.928 / 1.354 \approx 0.685\). Similarly \(A^{\text{new}}_{HC} \approx 0.315\), \(A^{\text{new}}_{CH} \approx 0.634\), \(A^{\text{new}}_{CC} \approx 0.366\).
New emissions: \(B^{\text{new}}_H(3) = (\gamma_1(H) + \gamma_3(H)) / (\gamma_1(H) + \gamma_2(H) + \gamma_3(H)) = (0.835 + 0.819) / 2.173 \approx 0.761\). \(B^{\text{new}}_H(1) = \gamma_2(H) / 2.173 \approx 0.239\). \(B^{\text{new}}_C(3) = (\gamma_1(C) + \gamma_3(C)) / (\gamma_1(C) + \gamma_2(C) + \gamma_3(C)) = (0.165 + 0.181) / 0.827 \approx 0.418\). \(B^{\text{new}}_C(1) \approx 0.582\).

After one iteration, the H state has become more strongly associated with emission \(3\) (was \(0.4\), now \(0.76\)), and \(A_{HH}\) has shrunk slightly because the trellis evidence pulls toward switching at \(t = 2\). With more iterations on this single short sequence the parameters would continue to drift toward whatever local optimum best explains \(x = (3, 1, 3)\).

Correctness

Each iteration is guaranteed to monotonically non-decrease \(\log p_\theta(x_{1:T})\) (assuming exact E-step/M-step and exact arithmetic). (proof) This follows from the general EM convergence proof: the E-step makes the variational posterior equal the exact posterior, and the M-step then maximizes the resulting Q-function, which is a lower bound on the log-likelihood that is tight at \(\theta^{(t)}\).

The likelihood surface for HMMs is non-convex with many local optima, so initialization matters and random restarts are standard. Common starting points: random parameters, \(k\)-means on the observations to seed emissions, or a smaller HMM bootstrapped to a larger one.

Complexity and Tradeoffs

Each iteration: \(O(K^2 T)\) per sequence for both E-step and M-step. Memory: \(O(KT)\) for the forward-backward tables. The number of iterations is governed by the convergence threshold and the difficulty of the likelihood surface — anywhere from tens to hundreds of iterations is typical.

Practical notes

Multiple sequences. Sum the sufficient statistics \(\sum_t \xi_t\) and \(\sum_t \gamma_t\) across all training sequences; the M-step formulas for \(A\) and \(B\) are unchanged. For \(\pi\), average rather than sum: \(\pi^{\text{new}}_k = \frac{1}{S}\sum_{s=1}^{S} \gamma_1^{(s)}(k)\), so the result is a valid probability vector.
Numerical stability. Use the rescaling or log-space variant of forward-backward — long sequences underflow \(\alpha\) and \(\beta\) otherwise.
Stopping criterion. Halt when the relative change in log-likelihood per iteration drops below a threshold. The likelihood is also useful to watch for bugs: it must not decrease, and decreasing values in floating point indicate numerical issues.
Identifiability. HMM parameters are only identifiable up to a permutation of the state labels. This is a non-issue for likelihood evaluation but means trained parameters from different runs cannot be compared componentwise without alignment.
Smoothing. Adding pseudocounts — equivalent to a Dirichlet prior with positive hyperparameters — to \(\pi\), \(A\), and discrete \(B\) avoids zero probabilities for unseen transitions or symbols, which would otherwise propagate \(-\infty\) log-likelihoods on test data.

When to Use It

Situation	Approach
Unsupervised HMM training (no state labels)	Baum-Welch.
Supervised HMM training (state labels available)	Maximum-likelihood from counts directly — no EM needed.
Partial labels available	Initialize from labeled counts, refine with Baum-Welch.
Multiple training sequences	Baum-Welch with summed sufficient statistics across sequences.
Stuck in a poor local optimum	Random restarts, \(k\)-means initialization for emissions, or replace HMM with a discriminative model.
Modern sequence modeling (speech, NLP)	Recurrent neural networks and transformers often replace HMMs end-to-end; the EM coordinate-ascent perspective survives in variational methods including the VAE.

References

Baum, Leonard E., Ted Petrie, George Soules, and Norman Weiss. 1970. “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains.” The Annals of Mathematical Statistics 41 (1): 164–71. https://doi.org/10.1214/aoms/1177697196.

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the <i>EM</i> Algorithm.” Journal of the Royal Statistical Society Series B: Statistical Methodology 39 (1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.