The Policy Gradient Theorem
Claim
Let \(\pi_\theta(a \mid s)\) be a stochastic policy, differentiable in \(\theta\), on a Markov decision process with transition kernel \(P(s' \mid s, a)\), reward \(R(s, a)\), discount \(\gamma \in [0, 1)\), and initial-state distribution \(\rho_0\). Let \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) denote a trajectory and \(R(\tau) = \sum_{t \ge 0} \gamma^t r_t\) the discounted return. Define \(J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\).
Then
\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right], \]
where \(Q^{\pi_\theta}(s, a) = \mathbb{E}_{\pi_\theta}\!\left[\sum_{k \ge 0} \gamma^k r_{t+k} \mid s_t = s,\, a_t = a\right]\) is the action-value function under \(\pi_\theta\) (Sutton et al. 2000).
Setup
The probability of a trajectory under \(\pi_\theta\) is
\[ P_\theta(\tau) = \rho_0(s_0) \prod_{t \ge 0} \pi_\theta(a_t \mid s_t)\, P(s_{t+1} \mid s_t, a_t). \]
Taking the logarithm and the gradient with respect to \(\theta\):
\[ \log P_\theta(\tau) = \log \rho_0(s_0) + \sum_{t \ge 0} \log \pi_\theta(a_t \mid s_t) + \sum_{t \ge 0} \log P(s_{t+1} \mid s_t, a_t). \]
The initial-state and transition terms do not depend on \(\theta\), so
\[ \nabla_\theta \log P_\theta(\tau) = \sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t). \]
This identity is the reason the model dynamics never appear in the policy gradient.
The Score-Function (Log-Derivative) Trick
For any density \(p_\theta(x)\) with \(p_\theta > 0\) wherever it appears,
\[ \nabla_\theta p_\theta(x) = p_\theta(x)\, \nabla_\theta \log p_\theta(x). \]
Hence for any random variable \(f(x)\),
\[ \nabla_\theta \mathbb{E}_{p_\theta}[f(X)] = \nabla_\theta \int f(x) p_\theta(x)\, dx = \int f(x) \nabla_\theta p_\theta(x)\, dx = \mathbb{E}_{p_\theta}[f(X)\, \nabla_\theta \log p_\theta(X)], \]
assuming the order of differentiation and integration can be exchanged (dominated convergence — true for finite MDPs with bounded rewards and sufficiently regular policies).
Proof
Step 1: Apply the score-function trick.
\[ \nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\!\bigl[R(\tau)\, \nabla_\theta \log P_\theta(\tau)\bigr]. \]
Substituting the expression for \(\nabla_\theta \log P_\theta(\tau)\) from the setup:
\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[R(\tau) \sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\right]. \]
Expand \(R(\tau) = \sum_{k \ge 0} \gamma^k r_k\) and exchange the order of summation:
\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \sum_{k \ge 0} \gamma^k r_k\right]. \]
Step 2: Causality — drop terms with \(k < t\).
Within the expectation, consider the cross term \(\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \gamma^k r_k\) for \(k < t\). The reward \(r_k\) depends only on \((s_0, a_0, \ldots, s_k, a_k)\) — quantities determined before \(a_t\) is sampled. Conditional on \(\mathcal{F}_t = \sigma(s_0, a_0, \ldots, s_t)\), the score \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) has zero mean:
\[ \mathbb{E}_{a_t \sim \pi_\theta(\cdot \mid s_t)}\!\bigl[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\bigr] = \sum_a \pi_\theta(a \mid s_t)\, \frac{\nabla_\theta \pi_\theta(a \mid s_t)}{\pi_\theta(a \mid s_t)} = \sum_a \nabla_\theta \pi_\theta(a \mid s_t) = \nabla_\theta\!\sum_a \pi_\theta(a \mid s_t) = \nabla_\theta(1) = 0. \]
So \(\mathbb{E}[\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot \gamma^k r_k] = \mathbb{E}[\gamma^k r_k \cdot \mathbb{E}[\nabla_\theta \log \pi_\theta(a_t \mid s_t) \mid \mathcal{F}_t, r_k]]\), and the inner expectation is zero. Cross terms with \(k < t\) vanish:
\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t) \sum_{k \ge t} \gamma^k r_k\right]. \]
Pull out \(\gamma^t\):
\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \gamma^t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \sum_{k \ge t} \gamma^{k - t} r_k\right]. \]
The inner sum is the future return from time \(t\), which we denote \(G_t = \sum_{k \ge t} \gamma^{k-t} r_k\).
Step 3: Replace \(G_t\) with \(Q^{\pi_\theta}(s_t, a_t)\).
Apply the tower property of conditional expectation, conditioning on \((s_t, a_t)\):
\[ \mathbb{E}\!\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t\right] = \mathbb{E}\!\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \mathbb{E}[G_t \mid s_t, a_t]\right] = \mathbb{E}\!\left[\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]. \]
The score \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) is measurable with respect to \(\sigma(s_t, a_t)\), so it can be pulled outside the inner expectation; and \(\mathbb{E}[G_t \mid s_t, a_t] = Q^{\pi_\theta}(s_t, a_t)\) by definition.
Combining:
\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \gamma^t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]. \qquad \blacksquare \]
The \(\gamma^t\) factor is sometimes absorbed into a “discounted state distribution” formulation; many practical algorithms drop it implicitly (treating each step as equally important), which biases the estimator slightly but is the empirical norm.
Remarks
Baseline invariance. Adding any \(\theta\)-independent function \(b(s)\) inside the expectation leaves the gradient unchanged, because \(\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s)] = b(s) \cdot 0 = 0\) by the calculation above. This is the variance-reduction tool used by every policy-gradient algorithm. The natural choice \(b(s) = V^{\pi_\theta}(s)\) replaces \(Q^{\pi_\theta}\) with the advantage \(A^{\pi_\theta}\).
Model-free. The derivation never needed \(\nabla_\theta \log P(s' \mid s, a)\) — the transition dynamics were eliminated in the very first step because \(P\) does not depend on \(\theta\). This is why policy gradient is the natural method for problems where the dynamics are unknown or non-differentiable.
Continuous actions. The derivation made no assumption that \(A\) is finite. Replacing the sum over actions with an integral and using a continuous density \(\pi_\theta(a \mid s)\) gives the identical result. This is one of the reasons policy gradient is preferred over Q-learning for continuous control.