REINFORCE

Motivation

REINFORCE (Williams 1992) is the simplest policy gradient algorithm: a pure Monte Carlo gradient estimator that uses the actually-observed return \(G_t\) as the value signal. It is rarely the best choice in practice — its variance is high and modern methods (PPO, SAC) substantially outperform it — but it is the prototype every other policy-gradient algorithm modifies.

Problem

Given a parameterized stochastic policy \(\pi_\theta(a \mid s)\) and an MDP with unknown dynamics, find parameters \(\theta\) that locally maximize the expected return

\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \gamma^t r_t\right]. \]

The agent must learn from sampled trajectories without access to the model.

Key Ideas

Score-function gradient estimator

The policy gradient theorem (Sutton et al. 2000) gives

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\sum_t \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, Q^{\pi_\theta}(s_t, a_t)\right]. \]

The factor \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\) is the score function — it points in the direction of parameter space that makes action \(a_t\) more likely at state \(s_t\). Multiplying by \(Q^{\pi_\theta}\) scales each push by how good the action turned out to be. Good actions get reinforced, bad actions discouraged.

Replace \(Q\) with the observed return

\(Q^{\pi_\theta}\) is unknown, but Monte Carlo gives an unbiased estimate: \(\mathbb{E}[G_t \mid s_t, a_t] = Q^{\pi_\theta}(s_t, a_t)\). Substituting \(G_t = \sum_{k = t}^{T} \gamma^{k-t} r_k\) for \(Q^{\pi_\theta}\) gives the REINFORCE update:

\[ \theta \leftarrow \theta + \alpha \sum_{t = 0}^{T} \nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t. \]

Each gradient is the score of the policy at the visited state-action pair, weighted by how good the trajectory turned out from that step onward.

Causality reduces variance for free

The use of \(G_t\) (the suffix return from \(t\)) rather than the full trajectory return \(G_0\) is justified by causality: future actions \(a_{t+1}, \ldots\) cannot affect past rewards \(r_0, \ldots, r_{t-1}\). Replacing the trajectory return with the suffix-from-\(t\) return keeps the gradient unbiased while strictly reducing variance.

Baselines reduce variance further, still unbiased

Subtracting a baseline \(b(s)\) from \(G_t\) leaves the gradient unbiased:

\[ \mathbb{E}_{a \sim \pi_\theta(\cdot \mid s)}[\nabla_\theta \log \pi_\theta(a \mid s) \cdot b(s)] = b(s) \cdot \nabla_\theta \!\left(\sum_a \pi_\theta(a \mid s)\right) = 0, \]

because the policy must sum to \(1\). But a well-chosen \(b(s)\) can dramatically reduce variance. The state-value function \(V_\phi(s)\) is the asymptotically optimal baseline, and the update becomes \(\nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot (G_t - V_\phi(s_t))\) — the advantage form. This is the bridge to actor-critic.

Algorithm

initialize θ
for episode = 1, 2, ...:
    sample trajectory τ = (s_0, a_0, r_0, ..., s_T, a_T, r_T) by following π_θ
    for t = T down to 0:
        G_t = r_t + γ * G_{t+1}                  # G_{T+1} = 0
    for t = 0 to T:
        θ += α * ∇_θ log π_θ(a_t | s_t) * G_t

The update is on-policy — the trajectory must be generated by the current \(\pi_\theta\) — and episodic: returns require an episode to finish before any update can be made.

Walkthrough

A one-step bandit

A degenerate MDP with \(T = 1\) and two actions \(\{0, 1\}\), reward \(r = 1\) for action \(0\) and \(r = 0\) for action \(1\). Parameterize the policy as \(\pi_\theta(a=0) = \sigma(\theta)\) where \(\sigma\) is the sigmoid. Start with \(\theta_0 = 0\) so \(\pi(0) = \pi(1) = 0.5\). Learning rate \(\alpha = 0.5\).

Score function: \(\nabla_\theta \log \pi_\theta(a=0) = 1 - \sigma(\theta)\) and \(\nabla_\theta \log \pi_\theta(a=1) = -\sigma(\theta)\).

Episode	\(a\) sampled	\(r\)	\(G_0\)	Score \(\nabla \log \pi\)	Update \(\alpha G_0 \nabla \log \pi\)	New \(\theta\)
1	\(0\) (prob 0.5)	\(1\)	\(1\)	\(1 - 0.5 = 0.5\)	\(0.5 \cdot 1 \cdot 0.5 = 0.25\)	\(0.25\)
2	\(1\) (prob \(\sigma(-0.25) \approx 0.438\))	\(0\)	\(0\)	\(-0.562\)	\(0\)	\(0.25\)
3	\(0\) (prob \(\sigma(0.25) \approx 0.562\))	\(1\)	\(1\)	\(1 - 0.562 = 0.438\)	\(0.219\)	\(0.469\)

After three episodes, \(\theta = 0.469\), so \(\pi(a=0) \approx 0.615\) — the policy has shifted toward the rewarded action. Notice episode \(2\) contributes nothing: when \(G = 0\), the update is zero. Bad actions are only implicitly discouraged by the policy summing to \(1\); the explicit signal comes from rewards.

With a baseline \(b = 0.5\) (the average reward), episode \(2\)’s update would be \(\alpha (0 - 0.5)(-0.562) = +0.140\) — explicitly increasing the probability of action \(0\) because action \(1\) underperformed. This is why baselines reduce variance: they let bad outcomes contribute information too.

Correctness

The policy-gradient theorem (proof) shows that the expected update direction is exactly the true gradient \(\nabla_\theta J(\theta)\). So in expectation, REINFORCE performs gradient ascent on the expected return. With a Robbins–Monro learning-rate schedule and bounded gradient noise, the algorithm converges to a stationary point of \(J\) — generally a local optimum, since \(J\) is non-convex in \(\theta\).

The baseline subtraction preserves unbiasedness because \(\mathbb{E}_a[\nabla \log \pi_\theta(a \mid s)] = 0\) for any policy that sums to \(1\) — so \(b(s)\) contributes zero to the expected update regardless of its value.

Complexity and Tradeoffs

Each episode costs one trajectory of environment interaction plus \(O(T)\) policy gradient evaluations. Sample efficiency is poor: each gradient step requires a full fresh on-policy trajectory.

Strengths.

Conceptually simple: one episode in, one gradient step out.
Unbiased gradient estimator (no function-approximation bias).
Works on discrete or continuous action spaces with no algorithmic change.
Clean foundation on which to build (baselines, advantage functions, importance sampling, trust regions).

Weaknesses.

High variance — especially in long episodes.
Strictly episodic — does not handle continuing problems without modification.
On-policy and Monte Carlo: each gradient costs one full episode of fresh data.
No mechanism to control step size; large updates can collapse the policy.

When to Use It

Situation	Approach
Pedagogical / prototype RL setup	REINFORCE — simple and instructive.
Want lower variance, still on-policy	Add a baseline / advantage; promote to actor-critic.
Need controlled step size	PPO / TRPO — see Variants.
Off-policy data available	Off-policy actor-critic (IMPALA, V-trace) or value-based methods like Q-learning.
Continuous actions, deterministic policy	DDPG, SAC, TD3 (deterministic policy gradient family).
Discrete actions, small state space, model known	Policy iteration or value iteration.

Variants

REINFORCE with baseline. \(\nabla \log \pi \cdot (G_t - b(s_t))\). Same expected gradient, lower variance. Common baselines: running-average return, or a learned value function \(V_\phi(s)\).
Actor-critic. Train a value function \(V_\phi\) alongside the actor and replace \(G_t - V_\phi(s_t)\) with a TD-style advantage estimate: \(A_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)\). Lower variance, with a bias-variance dial (proof) via \(n\)-step or GAE.
Trust region / clipping (TRPO, PPO). Constrain each update to keep the new policy close to the old, preventing destructive large updates that REINFORCE allows.
Off-policy correction (V-trace, IMPALA). Importance-sampling weights let REINFORCE-like updates use trajectories from older policies, enabling replay buffers.
Reparameterization (DDPG, SAC). For continuous actions, replace the score-function gradient with a pathwise gradient through a deterministic or stochastic reparameterized actor — much lower variance when applicable.

The bare REINFORCE update remains conceptually central: every other policy-gradient algorithm starts from \(\nabla_\theta \log \pi_\theta(a \mid s) \cdot \text{(value signal)}\) and improves the value signal, the variance, or the step.

References

Sutton, Richard S., David McAllester, Satinder Singh, and Yishay Mansour. 2000. “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” Advances in Neural Information Processing Systems (NeurIPS), 1057–63. https://proceedings.neurips.cc/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html.

Williams, Ronald J. 1992. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning 8 (3-4): 229–56. https://doi.org/10.1007/bf00992696.