The Policy Gradient Theorem

Claim

Let \(\pi_\theta(a \mid s)\) be a stochastic policy, differentiable in \(\theta\), on a Markov decision process with transition kernel \(P(s' \mid s, a)\), reward \(R(s, a)\), discount \(\gamma \in [0, 1)\), and initial-state distribution \(\rho_0\). Let \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) denote a trajectory and \(R(\tau) = \sum_{t \ge 0} \gamma^t r_t\) the discounted return. Define \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\).

Then

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right], \]

where \(Q^{\pi_\theta}(s, a) = \mathbb{E}_{\pi_\theta}\!\left[\sum_{k \ge 0} \gamma^k r_{t+k} \mid s_t = s,\, a_t = a\right]\) is the action-value function under \(\pi_\theta\) (Sutton et al. 2000).

Setup

The probability of a trajectory under \(\pi_\theta\) is

\[ P_\theta(\tau) = \rho_0(s_0) \prod_{t \ge 0} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t). \]

Taking the logarithm and the gradient with respect to \(\theta\):

\[ \log P_\theta(\tau) = \log \rho_0(s_0) + \sum_{t \ge 0} \log \pi_\theta(a_t \mid s_t) + \sum_{t \ge 0} \log P(s_{t+1} \mid s_t, a_t). \]

The initial-state and transition terms do not depend on \(\theta\), so

\[ \nabla_\theta \log P_\theta(\tau) = \sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t). \]

This identity is the reason the model dynamics never appear in the policy gradient.

The Score-Function (Log-Derivative) Trick

For any density \(p_\theta(x)\) with \(p_\theta > 0\) wherever it appears,

\[ \nabla_\theta p_\theta(x) = p_\theta(x)\, \nabla_\theta \log p_\theta(x). \]

Hence for any random variable \(f(x)\),

\[ \nabla_\theta \mathbb{E}_{p_\theta}[f(X)] = \nabla_\theta \int f(x) p_\theta(x)\, dx = \int f(x) \nabla_\theta p_\theta(x)\, dx = \mathbb{E}_{p_\theta}[f(X)\, \nabla_\theta \log p_\theta(X)], \]

assuming the order of differentiation and integration can be exchanged (dominated convergence — true for finite MDPs with bounded rewards and sufficiently regular policies).

Proof

Step 1: Apply the score-function trick.

\[ \nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\!\bigl[R(\tau)\, \nabla_\theta \log P_\theta(\tau)\bigr]. \]

Substituting the expression for \(\nabla_\theta \log P_\theta(\tau)\) from the setup:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\right]. \]

Expand \(R(\tau) = \sum_{k \ge 0} \gamma^k r_k\) and exchange the order of summation:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \sum_{k \ge 0} \gamma^k r_k\right]. \]

Step 2: Causality — drop terms with \(k < t\).

Within the expectation, consider the cross term \(\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \gamma^k r_k\) for \(k < t\). The reward \(r_k\) depends only on \((s_0, a_0, \ldots, s_k, a_k)\) — quantities determined before \(a_t\) is sampled. Conditional on \(\mathcal{F}_t = \sigma(s_0, a_0, \ldots, s_t)\), the score \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) has zero mean:

\[ \mathbb{E}_{a_t \sim \pi_\theta(\cdot \mid s_t)}\!\bigl[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\bigr] = \sum_a \pi_\theta(a \mid s_t)\, \frac{\nabla_\theta \pi_\theta(a \mid s_t)}{\pi_\theta(a \mid s_t)} = \sum_a \nabla_\theta \pi_\theta(a \mid s_t) = \nabla_\theta\!\sum_a \pi_\theta(a \mid s_t) = \nabla_\theta(1) = 0. \]

So \(\mathbb{E}[\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \gamma^k r_k] = \mathbb{E}[\gamma^k r_k \cdot \mathbb{E}[\nabla_\theta \log \pi_\theta(a_t \mid s_t) \mid \mathcal{F}_t, r_k]]\), and the inner expectation is zero. Cross terms with \(k < t\) vanish:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \sum_{k \ge t} \gamma^k r_k\right]. \]

Pull out \(\gamma^t\):

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \gamma^t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \sum_{k \ge t} \gamma^{k - t} r_k\right]. \]

The inner sum is the future return from time \(t\), which we denote \(G_t = \sum_{k \ge t} \gamma^{k-t} r_k\).

Step 3: Replace \(G_t\) with \(Q^{\pi_\theta}(s_t, a_t)\).

Apply the tower property of conditional expectation, conditioning on \((s_t, a_t)\):

\[ \mathbb{E}\!\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t\right] = \mathbb{E}\!\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \mathbb{E}[G_t \mid s_t, a_t]\right] = \mathbb{E}\!\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]. \]

The score \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) is measurable with respect to \(\sigma(s_t, a_t)\), so it can be pulled outside the inner expectation; and \(\mathbb{E}[G_t \mid s_t, a_t] = Q^{\pi_\theta}(s_t, a_t)\) by definition.

Combining:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \gamma^t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]. \qquad \blacksquare \]

The \(\gamma^t\) factor is sometimes absorbed into a “discounted state distribution” formulation; many practical algorithms drop it implicitly (treating each step as equally important), which biases the estimator slightly but is the empirical norm.

Remarks

Baseline invariance. Adding any \(\theta\)-independent function \(b(s)\) inside the expectation leaves the gradient unchanged, because \(\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s)] = b(s) \cdot 0 = 0\) by the calculation above. This is the variance-reduction tool used by every policy-gradient algorithm. The natural choice \(b(s) = V^{\pi_\theta}(s)\) replaces \(Q^{\pi_\theta}\) with the advantage \(A^{\pi_\theta}\).

Model-free. The derivation never needed \(\nabla_\theta \log P(s' \mid s, a)\) — the transition dynamics were eliminated in the very first step because \(P\) does not depend on \(\theta\). This is why policy gradient is the natural method for problems where the dynamics are unknown or non-differentiable.

Continuous actions. The derivation made no assumption that \(A\) is finite. Replacing the sum over actions with an integral and using a continuous density \(\pi_\theta(a \mid s)\) gives the identical result. This is one of the reasons policy gradient is preferred over Q-learning for continuous control.

References

Sutton, Richard S., David McAllester, Satinder Singh, and Yishay Mansour. 2000. “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” Advances in Neural Information Processing Systems (NeurIPS), 1057–63. https://proceedings.neurips.cc/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html.