EM Monotonically Increases the Likelihood
Claim
Let \(\theta^{(t)}\) be the parameters at iteration \(t\) of expectation-maximization, and let \(\theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)})\), where
\[ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{z \sim p_{\theta^{(t)}}(z \mid x)}\!\left[\log p_\theta(x, z)\right]. \]
Then
\[ \log p_{\theta^{(t+1)}}(x) \geq \log p_{\theta^{(t)}}(x), \]
with equality only at a stationary point of the log-likelihood (Dempster et al. 1977).
Proof via the ELBO
For any \(q\) and any \(\theta\),
\[ \log p_\theta(x) = \mathrm{ELBO}(q, \theta; x) + \mathrm{KL}(q(z) \,\|\, p_\theta(z \mid x)). \tag{$\ast$} \]
(proof)
E-step makes the bound tight at \(\theta^{(t)}\). Set \(q^{(t+1)}(z) = p_{\theta^{(t)}}(z \mid x)\). Then \(\mathrm{KL}(q^{(t+1)} \,\|\, p_{\theta^{(t)}}(\cdot \mid x)) = 0\), so \((\ast)\) at \(\theta = \theta^{(t)}\) gives
\[ \mathrm{ELBO}(q^{(t+1)}, \theta^{(t)}; x) = \log p_{\theta^{(t)}}(x). \tag{1} \]
M-step increases the ELBO at fixed \(q\). The ELBO at fixed \(q\) is
\[ \mathrm{ELBO}(q^{(t+1)}, \theta; x) = \mathbb{E}_{q^{(t+1)}}\!\left[\log p_\theta(x, z)\right] - \mathbb{E}_{q^{(t+1)}}\!\left[\log q^{(t+1)}(z)\right] = Q(\theta \mid \theta^{(t)}) + \text{const}, \]
where the “const” does not depend on \(\theta\). So \(\theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)})\) also maximizes the ELBO in \(\theta\), giving
\[ \mathrm{ELBO}(q^{(t+1)}, \theta^{(t+1)}; x) \geq \mathrm{ELBO}(q^{(t+1)}, \theta^{(t)}; x). \tag{2} \]
ELBO is a lower bound on the log-likelihood. From \((\ast)\) at \(\theta = \theta^{(t+1)}\) with KL non-negative,
\[ \log p_{\theta^{(t+1)}}(x) \geq \mathrm{ELBO}(q^{(t+1)}, \theta^{(t+1)}; x). \tag{3} \]
Chain. Combining (3), (2), and (1):
\[ \log p_{\theta^{(t+1)}}(x) \;\geq\; \mathrm{ELBO}(q^{(t+1)}, \theta^{(t+1)}; x) \;\geq\; \mathrm{ELBO}(q^{(t+1)}, \theta^{(t)}; x) \;=\; \log p_{\theta^{(t)}}(x). \quad \square \]
When Equality Holds
Equality in the chain requires equality in both (2) and (3). Equality in (2) means \(\theta^{(t)}\) already maximizes \(Q(\theta \mid \theta^{(t)})\), so \(\nabla_\theta Q(\theta \mid \theta^{(t)})\big|_{\theta = \theta^{(t)}} = 0\). By an identity of Dempster–Laird–Rubin, this gradient equals \(\nabla_\theta \log p_\theta(x)\big|_{\theta = \theta^{(t)}}\), so \(\theta^{(t)}\) is a stationary point of the log-likelihood.
Generalized EM
The argument only used (2) — that the M-step increases \(Q\), not necessarily that it maximizes it. So any update with \(Q(\theta^{(t+1)} \mid \theta^{(t)}) \geq Q(\theta^{(t)} \mid \theta^{(t)})\) is enough for monotone progress. This is called generalized EM and is what is actually run in practice when the M-step has no closed form (e.g., one gradient step in \(\theta\) instead of full maximization).
What Monotonicity Does Not Give
Monotonicity guarantees convergence to a stationary point — a local maximum, saddle point, or plateau. EM is well-known to get stuck in poor local optima for mixture models with many components or poorly chosen initialization. Random restarts and good initialization (e.g., k-means for GMMs) are standard mitigations.