Policy Gradient

Motivation

Value-based methods like Q-learning optimize a policy indirectly: learn \(Q^*\), then act greedily. Policy gradient methods skip this intermediary and optimize a parameterized policy \(\pi_\theta(a \mid s)\) directly by gradient ascent on expected return.

Direct policy optimization is the right choice when:

The action space is continuous (so \(\arg\max_a Q(s, a)\) is intractable).
The optimal policy is genuinely stochastic (multi-agent settings, partial observability).
A compact policy parameterization is more natural than a value function.
The objective involves expectations not easily expressed via Bellman equations (e.g., entropy-regularized, constraint-aware).

Modern algorithms — TRPO, PPO, SAC, DDPG — are all built on the policy gradient framework.

Setup

A parameterized stochastic policy \(\pi_\theta(a \mid s)\), differentiable in \(\theta\), on a Markov decision process. A trajectory is \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) with \(a_t \sim \pi_\theta(\cdot \mid s_t)\), \(s_{t+1} \sim P(\cdot \mid s_t, a_t)\). Return: \(R(\tau) = \sum_{t \ge 0} \gamma^t r_t\). Objective:

\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]. \]

The Policy Gradient Theorem

The gradient of \(J\) (Sutton et al. 2000) takes a remarkably clean form:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_{t \ge 0} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]. \]

(proof)

Two structural points worth highlighting:

Model-free. The transition dynamics \(P\) do not appear in the gradient — only the policy and the action-value function. The agent never needs a model.
Score-function form. The factor \(\nabla_\theta \log \pi_\theta(a \mid s)\) is the score function of the policy. The gradient is an expectation of “how to make this action more likely” weighted by “how good was this action.”

Variance Reduction with Baselines

The gradient estimator above is unbiased but high-variance. Subtract a state-dependent baseline \(b(s_t)\):

\[ \nabla_\theta J(\theta) = \mathbb{E}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, \bigl(Q^{\pi_\theta}(s_t, a_t) - b(s_t)\bigr)\right]. \]

The baseline is unbiased because \(\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a \mid s)] = 0\) (any function of \(s\) alone integrates to zero against the score). A well-chosen baseline can drastically reduce variance without bias.

The standard choice is \(b(s) = V^{\pi_\theta}(s)\), giving the advantage function \(A^{\pi_\theta}(s, a) = Q^{\pi_\theta}(s, a) - V^{\pi_\theta}(s)\) — the centered version of \(Q^\pi\), with mean zero under the action distribution.

Estimating the Advantage

Different ways of estimating \(A^{\pi_\theta}\) give different policy-gradient algorithms:

Estimator	Algorithm
\(G_t\) (Monte Carlo return)	REINFORCE
\(G_t - V_\phi(s_t)\)	REINFORCE with baseline
\(r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\) (TD error)	actor-critic / A2C / A3C
\(\sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\) (GAE)	PPO and modern variants

Higher-variance, lower-bias estimators (full Monte Carlo) sit at one extreme; lower-variance, higher-bias bootstrap estimators (TD residuals) at the other. Generalized Advantage Estimation (GAE) parameterizes the trade-off with \(\lambda \in [0, 1]\).

Algorithm Skeleton

initialize policy parameters theta and value parameters phi
for iteration = 1, 2, ...:
    collect trajectories tau by running pi_theta in the environment
    compute advantage estimates A_hat(s_t, a_t) for each step
    update theta:  theta += alpha_theta * sum_t grad_theta log pi_theta(a_t|s_t) * A_hat(s_t, a_t)
    update phi:    fit V_phi to returns or to bootstrapped targets

This template covers REINFORCE, A2C, A3C, TRPO, PPO, and many other modern algorithms; the differences lie in (a) how the advantage is estimated, (b) how large a step in \(\theta\) is taken, and (c) how the value function \(V_\phi\) is regularized or trained.

Practical Concerns

High variance. Even with baselines, gradient estimates are noisy. Large batch sizes and careful normalization help.
Sample inefficiency. Pure policy gradient is on-policy: data collected under one \(\theta\) cannot be reused after \(\theta\) changes. Off-policy variants (DDPG, SAC) reuse a replay buffer with importance correction.
Step size sensitivity. Too large a step can cause catastrophic policy collapse; too small wastes data. Trust-region methods (TRPO) and clipped objectives (PPO) constrain how far \(\theta\) can move per update.
Exploration. A deterministic policy gradient can collapse to a local optimum if the policy stops exploring. Maximum-entropy methods (SAC) add an entropy bonus to keep the policy stochastic.

On-Policy vs. Off-Policy

The policy gradient theorem is fundamentally on-policy: the expectation is over \(\tau \sim \pi_\theta\) for the current \(\theta\). Importance sampling can correct for off-policy data:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_0}}\!\left[\sum_t \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_0}(a_t \mid s_t)} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, A^{\pi_\theta}(s_t, a_t)\right]. \]

The importance ratios make off-policy gradients high-variance and brittle; modern methods (PPO) keep the policies “close” to limit the ratio, and others (SAC, DDPG) parameterize the policy and the actor differently to sidestep the issue.

References

Sutton, Richard S., David McAllester, Satinder Singh, and Yishay Mansour. 2000. “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” Advances in Neural Information Processing Systems (NeurIPS), 1057–63. https://proceedings.neurips.cc/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html.