Reparameterization Has Lower Variance Than the Score Function

Claim

Let \(f : \mathbb{R}^d \to \mathbb{R}\) be a differentiable function and let \(q_\phi(z)\) be a distribution that admits a reparameterization \(z = h_\phi(\varepsilon)\) with \(\varepsilon \sim p(\varepsilon)\) independent of \(\phi\). Both of the following are unbiased estimators of \(\nabla_\phi \mathbb{E}_{q_\phi}[f(z)]\):

  • Score-function (REINFORCE): \(\hat g_{\text{SF}} = f(z) \, \nabla_\phi \log q_\phi(z), \quad z \sim q_\phi.\)
  • Reparameterization (pathwise): \(\hat g_{\text{RP}} = \nabla_\phi f(h_\phi(\varepsilon)) = f'(h_\phi(\varepsilon)) \, \nabla_\phi h_\phi(\varepsilon), \quad \varepsilon \sim p.\)

The variance of \(\hat g_{\text{RP}}\) is governed by the local gradient of \(f\), while the variance of \(\hat g_{\text{SF}}\) is governed by the value of \(f\). For smooth \(f\) this typically gives the reparameterized estimator orders of magnitude smaller variance (Kingma and Welling 2013).

Both Estimators Are Unbiased

Score function. Differentiate under the integral, then apply the log-derivative trick \(\nabla_\phi q_\phi(z) = q_\phi(z) \, \nabla_\phi \log q_\phi(z)\):

\[ \nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \int f(z) \nabla_\phi q_\phi(z) \, dz = \int f(z) q_\phi(z) \nabla_\phi \log q_\phi(z) \, dz = \mathbb{E}_{q_\phi}[\hat g_{\text{SF}}]. \]

Reparameterization. A change of variables gives

\[ \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\varepsilon)}[f(h_\phi(\varepsilon))], \]

and now \(\phi\) appears only inside the integrand, so

\[ \nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{p(\varepsilon)}[\nabla_\phi f(h_\phi(\varepsilon))] = \mathbb{E}_p[\hat g_{\text{RP}}]. \]

Both are valid unbiased estimators; they differ in variance.

Where the Variance Difference Comes From

Consider the Gaussian case \(q_\phi = \mathcal{N}(\mu, \sigma^2)\) with \(\phi = \mu\), and let \(z = \mu + \sigma \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, 1)\). The two gradient estimators with respect to \(\mu\) become:

\[ \hat g_{\text{SF}} = f(z) \cdot \frac{z - \mu}{\sigma^2} = f(\mu + \sigma \varepsilon) \cdot \frac{\varepsilon}{\sigma}, \qquad \hat g_{\text{RP}} = f'(\mu + \sigma \varepsilon). \]

Both have the same mean (the true gradient \(\partial_\mu \mathbb{E}[f(z)]\), which by Stein’s lemma equals \(\mathbb{E}[f'(z)]\)). Their variances are very different.

Reparameterization variance. \(\mathrm{Var}(\hat g_{\text{RP}}) = \mathrm{Var}(f'(z))\). If \(f'\) is bounded on the support of \(q_\phi\) — say \(|f'| \leq L\) — then \(\mathrm{Var}(\hat g_{\text{RP}}) \leq L^2\), independent of \(\sigma\).

Score-function variance. \(\mathrm{Var}(\hat g_{\text{SF}}) = \mathrm{Var}(f(z) \varepsilon / \sigma) \approx \mathbb{E}[f(z)^2 \varepsilon^2] / \sigma^2\). If \(|f| \leq M\), this is bounded by \(M^2 / \sigma^2\). Crucially, \(\mathrm{Var}(\hat g_{\text{SF}})\) blows up as \(\sigma \to 0\), while \(\mathrm{Var}(\hat g_{\text{RP}})\) remains bounded.

The ratio is roughly \((M/L)^2 / \sigma^2\). For \(f\) that is large in magnitude but slowly varying (e.g., a per-pixel reconstruction log-likelihood, where \(|f|\) is hundreds of nats but \(|f'|\) is small), this ratio is enormous.

A Cleaner Statement via Stein’s Lemma

For Gaussian \(q\), Stein’s lemma gives

\[ \mathbb{E}\!\left[\frac{f(z)(z - \mu)}{\sigma^2}\right] = \mathbb{E}[f'(z)], \]

so the score-function and reparameterization estimators have the same expectation. But the score-function form \(f(z)(z - \mu)/\sigma^2\) multiplies by an unbounded random variable \((z - \mu)/\sigma^2\), while \(f'(z)\) does not. The score-function form deliberately reintroduces noise that the reparameterization integrates out analytically.

This is the underlying mechanism: reparameterization performs (locally) an analytic integration that the score-function estimator leaves to Monte Carlo.

Why “Two to Three Orders of Magnitude”

Empirical reports for VAE training observe variance reductions of \(10^2\) to \(10^4\). The argument above explains the rough scale: with \(\sigma\) on the order of \(0.1\) and \(f\) on the order of \(100\) nats per image while \(f'\) is \(O(1)\) per latent coordinate, \((M/L)^2 / \sigma^2 \approx 10^4\). The exact factor depends on the model and the data, but the dependence on \(\sigma\) and on the smoothness of \(f\) is universal.

Caveats

  • Bounded \(f\) is a real condition. For unbounded \(f\) (e.g., heavy-tailed log-likelihoods), the score-function variance can be infinite while the reparameterization variance is finite. The argument above is only one direction.
  • Reparameterization requires a differentiable path. When \(z\) is discrete this construction does not exist; the score-function estimator is the only fully general option, possibly with control variates to mitigate its variance.
  • The argument is a smoothness statement about \(f\), not the distribution. It is the local Lipschitz constant of \(f\) that controls \(\hat g_{\text{RP}}\)’s variance — the variance depends on how \(f\) changes, not on its absolute level.

References

Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv Preprint arXiv:1312.6114. https://arxiv.org/abs/1312.6114.