EM Monotonically Increases the Likelihood

Claim

Let $\theta^{(t)}$ be the parameters at iteration $t$ of expectation-maximization, and let $\theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)})$, where

\[ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{z \sim p_{\theta^{(t)}}(z \mid x)}\!\left[\log p_\theta(x, z)\right]. \]

Then

\[ \log p_{\theta^{(t+1)}}(x) \geq \log p_{\theta^{(t)}}(x), \]

with equality only at a stationary point of the log-likelihood (Dempster et al. 1977).

Proof via the ELBO

For any $q$ and any $\theta$,

\[ \log p_\theta(x) = \mathrm{ELBO}(q, \theta; x) + \mathrm{KL}(q(z) \,\|\, p_\theta(z \mid x)). \tag{$\ast$} \]

(proof)

E-step makes the bound tight at $\theta^{(t)}$. Set $q^{(t+1)}(z) = p_{\theta^{(t)}}(z \mid x)$. Then $\mathrm{KL}(q^{(t+1)} \,\|\, p_{\theta^{(t)}}(\cdot \mid x)) = 0$, so $(\ast)$ at $\theta = \theta^{(t)}$ gives

\[ \mathrm{ELBO}(q^{(t+1)}, \theta^{(t)}; x) = \log p_{\theta^{(t)}}(x). \tag{1} \]

M-step increases the ELBO at fixed $q$. The ELBO at fixed $q$ is

\[ \mathrm{ELBO}(q^{(t+1)}, \theta; x) = \mathbb{E}_{q^{(t+1)}}\!\left[\log p_\theta(x, z)\right] - \mathbb{E}_{q^{(t+1)}}\!\left[\log q^{(t+1)}(z)\right] = Q(\theta \mid \theta^{(t)}) + \text{const}, \]

where the “const” does not depend on $\theta$. So $\theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)})$ also maximizes the ELBO in $\theta$, giving

\[ \mathrm{ELBO}(q^{(t+1)}, \theta^{(t+1)}; x) \geq \mathrm{ELBO}(q^{(t+1)}, \theta^{(t)}; x). \tag{2} \]

ELBO is a lower bound on the log-likelihood. From $(\ast)$ at $\theta = \theta^{(t+1)}$ with KL non-negative,

\[ \log p_{\theta^{(t+1)}}(x) \geq \mathrm{ELBO}(q^{(t+1)}, \theta^{(t+1)}; x). \tag{3} \]

Chain. Combining (3), (2), and (1):

\[ \log p_{\theta^{(t+1)}}(x) \;\geq\; \mathrm{ELBO}(q^{(t+1)}, \theta^{(t+1)}; x) \;\geq\; \mathrm{ELBO}(q^{(t+1)}, \theta^{(t)}; x) \;=\; \log p_{\theta^{(t)}}(x). \quad \square \]

When Equality Holds

Equality in the chain requires equality in both (2) and (3). Equality in (2) means $\theta^{(t)}$ already maximizes $Q(\theta \mid \theta^{(t)})$, so $\nabla_\theta Q(\theta \mid \theta^{(t)})\big|_{\theta = \theta^{(t)}} = 0$. By an identity of Dempster–Laird–Rubin, this gradient equals $\nabla_\theta \log p_\theta(x)\big|_{\theta = \theta^{(t)}}$, so $\theta^{(t)}$ is a stationary point of the log-likelihood.

Generalized EM

The argument only used (2) — that the M-step increases $Q$, not necessarily that it maximizes it. So any update with $Q(\theta^{(t+1)} \mid \theta^{(t)}) \geq Q(\theta^{(t)} \mid \theta^{(t)})$ is enough for monotone progress. This is called generalized EM and is what is actually run in practice when the M-step has no closed form (e.g., one gradient step in $\theta$ instead of full maximization).

What Monotonicity Does Not Give

Monotonicity guarantees convergence to a stationary point — a local maximum, saddle point, or plateau. EM is well-known to get stuck in poor local optima for mixture models with many components or poorly chosen initialization. Random restarts and good initialization (e.g., k-means for GMMs) are standard mitigations.

References

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data via the <i>EM</i> Algorithm.” Journal of the Royal Statistical Society Series B: Statistical Methodology 39 (1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.