REINFORCE
Motivation
REINFORCE (Williams 1992) is the simplest policy gradient algorithm: a pure Monte Carlo gradient estimator that uses the actually-observed return \(G_t\) as the value signal. It is rarely the best choice in practice — its variance is high and modern methods (PPO, SAC) substantially outperform it — but it is the prototype every other policy-gradient algorithm modifies.
Problem
Given a parameterized stochastic policy \(\pi_\theta(a \mid s)\) and an MDP with unknown dynamics, find parameters \(\theta\) that locally maximize the expected return
\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \gamma^t r_t\right]. \]
The agent must learn from sampled trajectories without access to the model.
Key Ideas
Score-function gradient estimator
The policy gradient theorem (Sutton et al. 2000) gives
\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]. \]
The factor \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) is the score function — it points in the direction of parameter space that makes action \(a_t\) more likely at state \(s_t\). Multiplying by \(Q^{\pi_\theta}\) scales each push by how good the action turned out to be. Good actions get reinforced, bad actions discouraged.
Replace \(Q\) with the observed return
\(Q^{\pi_\theta}\) is unknown, but Monte Carlo gives an unbiased estimate: \(\mathbb{E}[G_t \mid s_t, a_t] = Q^{\pi_\theta}(s_t, a_t)\). Substituting \(G_t = \sum_{k = t}^{T} \gamma^{k-t} r_k\) for \(Q^{\pi_\theta}\) gives the REINFORCE update:
\[ \theta \leftarrow \theta + \alpha \sum_{t = 0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t. \]
Each gradient is the score of the policy at the visited state-action pair, weighted by how good the trajectory turned out from that step onward.
Causality reduces variance for free
The use of \(G_t\) (the suffix return from \(t\)) rather than the full trajectory return \(G_0\) is justified by causality: future actions \(a_{t+1}, \ldots\) cannot affect past rewards \(r_0, \ldots, r_{t-1}\). Replacing the trajectory return with the suffix-from-\(t\) return keeps the gradient unbiased while strictly reducing variance.
Baselines reduce variance further, still unbiased
Subtracting a baseline \(b(s)\) from \(G_t\) leaves the gradient unbiased:
\[ \mathbb{E}_{a \sim \pi_\theta(\cdot \mid s)}[\nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s)] = b(s) \cdot \nabla_\theta \!\left(\sum_a \pi_\theta(a \mid s)\right) = 0, \]
because the policy must sum to \(1\). But a well-chosen \(b(s)\) can dramatically reduce variance. The state-value function \(V_\phi(s)\) is the asymptotically optimal baseline, and the update becomes \(\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - V_\phi(s_t))\) — the advantage form. This is the bridge to actor-critic.
Algorithm
initialize θ
for episode = 1, 2, ...:
sample trajectory τ = (s_0, a_0, r_0, ..., s_T, a_T, r_T) by following π_θ
for t = T down to 0:
G_t = r_t + γ * G_{t+1} # G_{T+1} = 0
for t = 0 to T:
θ += α * ∇_θ log π_θ(a_t | s_t) * G_t
The update is on-policy — the trajectory must be generated by the current \(\pi_\theta\) — and episodic: returns require an episode to finish before any update can be made.
Walkthrough
A one-step bandit
A degenerate MDP with \(T = 1\) and two actions \(\{0, 1\}\), reward \(r = 1\) for action \(0\) and \(r = 0\) for action \(1\). Parameterize the policy as \(\pi_\theta(a=0) = \sigma(\theta)\) where \(\sigma\) is the sigmoid. Start with \(\theta_0 = 0\) so \(\pi(0) = \pi(1) = 0.5\). Learning rate \(\alpha = 0.5\).
Score function: \(\nabla_\theta \log \pi_\theta(a=0) = 1 - \sigma(\theta)\) and \(\nabla_\theta \log \pi_\theta(a=1) = -\sigma(\theta)\).
| Episode | \(a\) sampled | \(r\) | \(G_0\) | Score \(\nabla \log \pi\) | Update \(\alpha G_0 \nabla \log \pi\) | New \(\theta\) |
|---|---|---|---|---|---|---|
| 1 | \(0\) (prob 0.5) | \(1\) | \(1\) | \(1 - 0.5 = 0.5\) | \(0.5 \cdot 1 \cdot 0.5 = 0.25\) | \(0.25\) |
| 2 | \(1\) (prob \(\sigma(-0.25) \approx 0.438\)) | \(0\) | \(0\) | \(-0.562\) | \(0\) | \(0.25\) |
| 3 | \(0\) (prob \(\sigma(0.25) \approx 0.562\)) | \(1\) | \(1\) | \(1 - 0.562 = 0.438\) | \(0.219\) | \(0.469\) |
After three episodes, \(\theta = 0.469\), so \(\pi(a=0) \approx 0.615\) — the policy has shifted toward the rewarded action. Notice episode \(2\) contributes nothing: when \(G = 0\), the update is zero. Bad actions are only implicitly discouraged by the policy summing to \(1\); the explicit signal comes from rewards.
With a baseline \(b = 0.5\) (the average reward), episode \(2\)’s update would be \(\alpha (0 - 0.5)(-0.562) = +0.140\) — explicitly increasing the probability of action \(0\) because action \(1\) underperformed. This is why baselines reduce variance: they let bad outcomes contribute information too.
Correctness
The policy-gradient theorem (proof) shows that the expected update direction is exactly the true gradient \(\nabla_\theta J(\theta)\). So in expectation, REINFORCE performs gradient ascent on the expected return. With a Robbins–Monro learning-rate schedule and bounded gradient noise, the algorithm converges to a stationary point of \(J\) — generally a local optimum, since \(J\) is non-convex in \(\theta\).
The baseline subtraction preserves unbiasedness because \(\mathbb{E}_a[\nabla \log \pi_\theta(a \mid s)] = 0\) for any policy that sums to \(1\) — so \(b(s)\) contributes zero to the expected update regardless of its value.
Complexity and Tradeoffs
Each episode costs one trajectory of environment interaction plus \(O(T)\) policy gradient evaluations. Sample efficiency is poor: each gradient step requires a full fresh on-policy trajectory.
Strengths.
- Conceptually simple: one episode in, one gradient step out.
- Unbiased gradient estimator (no function-approximation bias).
- Works on discrete or continuous action spaces with no algorithmic change.
- Clean foundation on which to build (baselines, advantage functions, importance sampling, trust regions).
Weaknesses.
- High variance — especially in long episodes.
- Strictly episodic — does not handle continuing problems without modification.
- On-policy and Monte Carlo: each gradient costs one full episode of fresh data.
- No mechanism to control step size; large updates can collapse the policy.
When to Use It
| Situation | Approach |
|---|---|
| Pedagogical / prototype RL setup | REINFORCE — simple and instructive. |
| Want lower variance, still on-policy | Add a baseline / advantage; promote to actor-critic. |
| Need controlled step size | PPO / TRPO — see Variants. |
| Off-policy data available | Off-policy actor-critic (IMPALA, V-trace) or value-based methods like Q-learning. |
| Continuous actions, deterministic policy | DDPG, SAC, TD3 (deterministic policy gradient family). |
| Discrete actions, small state space, model known | Policy iteration or value iteration. |
Variants
- REINFORCE with baseline. \(\nabla \log \pi \cdot (G_t - b(s_t))\). Same expected gradient, lower variance. Common baselines: running-average return, or a learned value function \(V_\phi(s)\).
- Actor-critic. Train a value function \(V_\phi\) alongside the actor and replace \(G_t - V_\phi(s_t)\) with a TD-style advantage estimate: \(A_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\). Lower variance, with a bias-variance dial (proof) via \(n\)-step or GAE.
- Trust region / clipping (TRPO, PPO). Constrain each update to keep the new policy close to the old, preventing destructive large updates that REINFORCE allows.
- Off-policy correction (V-trace, IMPALA). Importance-sampling weights let REINFORCE-like updates use trajectories from older policies, enabling replay buffers.
- Reparameterization (DDPG, SAC). For continuous actions, replace the score-function gradient with a pathwise gradient through a deterministic or stochastic reparameterized actor — much lower variance when applicable.
The bare REINFORCE update remains conceptually central: every other policy-gradient algorithm starts from \(\nabla_\theta \log \pi_\theta(a \mid s) \cdot \text{(value signal)}\) and improves the value signal, the variance, or the step.