Expectation-Maximization
Motivation
Expectation-maximization (EM) (Dempster et al. 1977) is an iterative algorithm for maximum-likelihood estimation in latent-variable models. When the marginal likelihood
\[ \log p_\theta(x) = \log \int p_\theta(x, z) \, dz \]
is intractable but the complete-data likelihood \(\log p_\theta(x, z)\) is easy to optimize given known \(z\), EM alternates between filling in expected latents and maximizing the resulting expected log-joint. Each iteration is guaranteed to increase \(\log p_\theta(x)\) (or leave it unchanged).
EM is the standard fitting procedure for Gaussian mixture models, hidden Markov models (where it is called Baum-Welch), and probabilistic PCA. It is also the conceptual ancestor of variational EM and the VAE.
Problem
Given a latent-variable model \(p_\theta(x, z)\) with observed data \(x\), parameters \(\theta\), and unobserved latents \(z\), find \(\theta^* = \arg\max_\theta \log p_\theta(x)\) where \(\log p_\theta(x) = \log \int p_\theta(x, z)\, dz\).
The integral over \(z\) is typically intractable, so direct gradient ascent on \(\log p_\theta(x)\) requires expensive Monte Carlo estimates. EM avoids the integral by working with the complete-data likelihood and the posterior over \(z\) instead.
Key Ideas
The ELBO decomposes the log-likelihood
For any distribution \(q(z)\),
\[ \log p_\theta(x) = \mathrm{ELBO}(q, \theta; x) + \mathrm{KL}\!\left(q(z) \,\|\, p_\theta(z \mid x)\right). \]
The KL term is non-negative (proof), so the evidence lower bound is a lower bound on the log-likelihood that becomes tight when \(q\) matches the posterior. EM exploits this: instead of optimizing the intractable log-likelihood, optimize the ELBO — a tractable expectation — and trust the bound to drag the likelihood up with it.
Coordinate ascent on the ELBO
The ELBO has two arguments: \(q\) and \(\theta\). EM alternately optimizes each holding the other fixed.
- E-step maximizes over \(q\) with \(\theta\) fixed. The optimal \(q\) is the exact posterior \(p_\theta(z \mid x)\), which makes the KL zero and the bound tight.
- M-step maximizes over \(\theta\) with \(q\) fixed. Because the \(\log q\) term doesn’t depend on \(\theta\), this reduces to maximizing \(\mathbb{E}_q[\log p_\theta(x, z)]\) — a weighted complete-data likelihood, which is typically a closed-form weighted MLE.
The pattern is soft assignment in the E-step, weighted MLE in the M-step. This is what EM looks like in any exponential-family mixture model.
Soft assignments via responsibilities
For a mixture model, the E-step assigns each point fractional membership (a responsibility \(\gamma_{ik}\)) in every component. The M-step moves each component’s parameters toward the weighted average of the points it owns.
Deriving the Solution
Initialize \(\theta^{(0)}\). For \(t = 0, 1, 2, \ldots\) until convergence:
E-step. Set \(q^{(t+1)}(z) = p_{\theta^{(t)}}(z \mid x)\). With this choice the KL term is zero, so \(\mathrm{ELBO}(q^{(t+1)}, \theta^{(t)}; x) = \log p_{\theta^{(t)}}(x)\) — the bound is tight at the current \(\theta\).
M-step. Update \(\theta\) by maximizing the expected complete-data log-likelihood under the fixed \(q^{(t+1)}\):
\[ \theta^{(t+1)} = \operatorname*{arg\,max}_\theta \mathbb{E}_{z \sim q^{(t+1)}}\!\left[\log p_\theta(x, z)\right]. \]
(The \(\log q\) term in the ELBO does not depend on \(\theta\), so it can be dropped.) This expected log-joint is often called the Q-function:
\[ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{z \sim p_{\theta^{(t)}}(z \mid x)}\!\left[\log p_\theta(x, z)\right]. \]
The M-step is just \(\theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)})\).
Algorithm
initialize θ⁽⁰⁾
for t = 0, 1, 2, ... until log-likelihood converges:
# E-step
q⁽ᵗ⁺¹⁾(z) = p_{θ⁽ᵗ⁾}(z | x) # exact posterior
Q(θ | θ⁽ᵗ⁾) = E_{z ~ q⁽ᵗ⁺¹⁾}[log p_θ(x, z)]
# M-step
θ⁽ᵗ⁺¹⁾ = argmax_θ Q(θ | θ⁽ᵗ⁾)
The work per iteration depends entirely on the model: a closed-form posterior makes the E-step cheap, and an exponential-family complete-data likelihood makes the M-step a single closed-form weighted MLE.
Walkthrough
Gaussian Mixture Model
Observations \(x_1, \ldots, x_n\) are modeled as draws from a mixture of \(K\) Gaussians: latent \(z_i \in \{1, \ldots, K\}\) chooses a component, and \(x_i \mid z_i = k \sim \mathcal{N}(\mu_k, \Sigma_k)\). Parameters are mixing weights \(\pi_k\), means \(\mu_k\), covariances \(\Sigma_k\).
E-step. Compute responsibilities
\[ \gamma_{ik} = p(z_i = k \mid x_i; \theta^{(t)}) = \frac{\pi_k^{(t)} \mathcal{N}(x_i \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_j \pi_j^{(t)} \mathcal{N}(x_i \mid \mu_j^{(t)}, \Sigma_j^{(t)})}. \]
Each \(\gamma_{ik}\) is the posterior probability that point \(i\) was generated by component \(k\), given the current parameters. Points clearly inside one cluster get \(\gamma\) close to \(1\) for that component; points in overlap regions split their weight.
M-step. Closed-form weighted MLE updates:
\[ \pi_k^{(t+1)} = \frac{1}{n} \sum_i \gamma_{ik}, \qquad \mu_k^{(t+1)} = \frac{\sum_i \gamma_{ik} x_i}{\sum_i \gamma_{ik}}, \qquad \Sigma_k^{(t+1)} = \frac{\sum_i \gamma_{ik} (x_i - \mu_k^{(t+1)})(x_i - \mu_k^{(t+1)})^\top}{\sum_i \gamma_{ik}}. \]
These are weighted versions of the standard Gaussian MLE formulas, with weights given by the responsibilities. After the update, the points that the algorithm currently believes belong to component \(k\) pull \(\mu_k\) toward them; points the algorithm believes belong elsewhere have little influence.
Iterating E and M alternately drives the mixture parameters toward a local optimum of the log-likelihood — typically converging in a few tens of iterations on well-separated data.
Correctness
The crucial property is that EM monotonically increases the log-likelihood:
\[ \log p_{\theta^{(t+1)}}(x) \geq \log p_{\theta^{(t)}}(x). \]
The argument is short. After the E-step, the bound is tight: \(\mathrm{ELBO}(q^{(t+1)}, \theta^{(t)}; x) = \log p_{\theta^{(t)}}(x)\). The M-step increases the ELBO in \(\theta\), so \(\mathrm{ELBO}(q^{(t+1)}, \theta^{(t+1)}; x) \geq \log p_{\theta^{(t)}}(x)\). And the ELBO is a lower bound, so \(\log p_{\theta^{(t+1)}}(x) \geq \mathrm{ELBO}(q^{(t+1)}, \theta^{(t+1)}; x) \geq \log p_{\theta^{(t)}}(x)\). (proof)
This guarantees convergence to a stationary point of the likelihood, but not necessarily to the global maximum — the likelihood surface for mixture-style models is generally non-convex with many local optima.
Complexity and Tradeoffs
The per-iteration cost is dominated by the E-step’s posterior computation. For a \(K\)-component Gaussian mixture over \(n\) points: \(O(nK)\) density evaluations per E-step, plus \(O(nK)\) to compute the weighted statistics in the M-step.
EM’s main practical weaknesses are:
- Local optima. The likelihood surface is non-convex. Random restarts and \(k\)-means initialization are standard.
- Slow convergence near optima. EM has linear convergence rate in the worst case; second-order methods (gradient/Hessian on the log-likelihood) can be faster once the basin is found.
- Identifiability. Permutations of component labels give equivalent models. Trained parameters from different runs cannot be compared componentwise without alignment.
- Singular covariance. A component that owns very few points (or one point) can collapse to zero covariance and infinite likelihood. Regularize \(\Sigma_k\) with a small ridge, or merge underweighted components.
When to Use It
| Situation | Approach |
|---|---|
| Latent-variable model, posterior closed-form | EM. |
| HMM specifically | Baum-Welch (EM specialized to chains). |
| Posterior intractable but a tractable family is close | Variational EM — see Variants. |
| Posterior intractable, deep generative model | Amortized VI (VAE) — see Variants. |
| Latent variables observed | Don’t use EM — fit by maximum likelihood from counts directly. |
| Highly non-convex landscape, robust to local optima needed | Multiple restarts; or replace with a Bayesian method (MCMC) that explores rather than locally optimizes. |
Variants
- Variational EM. When the posterior \(p_\theta(z \mid x)\) has no closed form, restrict \(q\) to a tractable family \(\mathcal{Q}\) and optimize over \(\mathcal{Q}\) in the E-step. The bound is no longer tight, so the monotonicity argument applies to the ELBO rather than the log-likelihood. The fixed point is a stationary point of the ELBO, not of the log-likelihood.
- Amortized variational inference. Replace the per-datapoint \(q\) with a learned encoder \(q_\phi(z \mid x)\) and jointly optimize \(\theta\) and \(\phi\) by SGD on the ELBO. This is the VAE. The encoder amortizes the per-datapoint inference cost into a one-time training cost.
- Hard EM. Replace the soft posterior \(q^{(t+1)}\) with a point mass at the MAP latent \(\hat{z}_i\). Recovers \(k\)-means as the hard-EM limit of GMM-EM with shared isotropic covariance.
- Generalized EM. The M-step only needs to improve \(Q\), not maximize it. A single gradient step still gives monotonic likelihood improvement.
- Stochastic EM. Use minibatches for the E-step and M-step to handle large datasets.