Temporal Difference Learning

Motivation

In a Markov decision process with an unknown model, the agent cannot compute \(V^\pi\) by policy iteration or value iteration — those algorithms require \(P\) and \(R\). The agent can, however, interact with the environment and observe transitions \((s, a, r, s')\). Temporal difference (TD) learning (Sutton and Barto 2018) estimates \(V^\pi\) from this stream of samples by combining two ideas: sampling, like Monte Carlo methods, and bootstrapping, like dynamic programming.

TD learning is the conceptual core of model-free reinforcement learning; Q-learning, SARSA, and actor-critic methods are all built on the same TD update.

Problem

Given a fixed policy \(\pi\) in an MDP with unknown dynamics, estimate \(V^\pi(s) = \mathbb{E}\!\left[\sum_{t \ge 0} \gamma^t r_t \mid s_0 = s,\, \pi\right]\) from a stream of observed transitions \(s_0, a_0, r_0, s_1, a_1, r_1, \ldots\) generated by following \(\pi\).

Key Ideas

TD error: sample one transition, bootstrap the rest

The expectation defining \(V^\pi(s)\) has a recursive Bellman form:

\[ V^\pi(s) = \mathbb{E}_\pi[r + \gamma V^\pi(s') \mid s]. \]

After observing one transition \((s_t, r_t, s_{t+1})\), the random quantity \(r_t + \gamma V(s_{t+1})\) is an unbiased sample of \(V^\pi(s_t)\) assuming \(V \approx V^\pi\). The TD error

\[ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]

measures the discrepancy between the current estimate and the bootstrap target. Moving \(V(s_t)\) in the direction of \(\delta_t\) shrinks the gap.

Sampling + bootstrapping

TD sits between two methods that each give up one of these features.

Monte Carlo waits for the entire return \(G_t\), never bootstraps. Unbiased but high variance, and only usable after the episode ends.
Dynamic programming uses the full expectation \(\mathbb{E}[r + \gamma V(s')]\), never samples. Needs a model.

TD samples and bootstraps: it uses one transition’s reward plus the current estimate of \(V(s_{t+1})\). Biased while \(V\) is wrong, but lower variance than MC and applicable on every single step — including non-episodic continuing tasks.

Stochastic approximation of \(\mathcal{T}^\pi\)

The expectation of the TD target is exactly the Bellman expectation operator \(\mathcal{T}^\pi\) applied to \(V\):

\[ \mathbb{E}[r + \gamma V(s') \mid s] = (\mathcal{T}^\pi V)(s). \]

So TD(0) is a stochastic approximation of the fixed-point iteration \(V \leftarrow \mathcal{T}^\pi V\), with the expectation replaced by a single sample. The fixed point is \(V^\pi\).

Algorithm

initialize V(s) arbitrarily
for each episode:
    s = initial state
    while s is not terminal:
        a = π(s)
        take action a; observe r, s'
        V(s) += α * (r + γ * V(s') - V(s))
        s = s'

The key step is the TD update \(V(s) \leftarrow V(s) + \alpha \delta\), where \(\delta = r + \gamma V(s') - V(s)\) is the TD error.

Walkthrough

Random walk on five states

A classic toy example: states \(A, B, C, D, E\) in a line, with terminal states left of \(A\) (reward \(0\)) and right of \(E\) (reward \(1\)). The agent takes random left/right steps. The true value function under this random policy is \(V^\pi = (1/6, 2/6, 3/6, 4/6, 5/6)\).

Initialize \(V \equiv 0.5\) for all non-terminal states. Use \(\alpha = 0.1\), \(\gamma = 1\).

Suppose the first episode trajectory is \(C \to B \to A \to \text{terminal-left}\) with rewards \(0, 0, 0\).

Step	\(s\)	\(s'\)	\(V(s')\)	TD error \(\delta\)	New \(V(s)\)
1	C	B	\(0.5\)	\(0 + 1 \cdot 0.5 - 0.5 = 0\)	\(0.5 + 0.1 \cdot 0 = 0.5\)
2	B	A	\(0.5\)	\(0\)	\(0.5\)
3	A	terminal-left	\(0\)	\(0 + 1 \cdot 0 - 0.5 = -0.5\)	\(0.5 + 0.1 \cdot (-0.5) = 0.45\)

Only the last step in this episode produced a non-zero update. Monte Carlo would have updated every state on the trajectory by the same \(G - V\) amount; TD updates only \(A\), but the next episode that visits \(B\) will see \(V(A) = 0.45\) as its target — so the information propagates backward one state per episode. Over many episodes, the values converge to \(V^\pi\).

This is what “bootstrap” means in practice: TD’s first useful update happens only when the agent reaches a terminal (or already-estimated) state, and the estimate then trickles back through subsequent visits.

Correctness

In the tabular case, TD(0) converges to \(V^\pi\) with probability 1 if every state is visited infinitely often and the learning rate \(\alpha_t\) satisfies the Robbins–Monro conditions \(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\). The argument is the same one used for Q-learning convergence: TD(0) is a noisy fixed-point iteration of a \(\gamma\)-contraction.

With function approximation (linear or neural), convergence guarantees become much more delicate. TD with linear function approximation converges only under on-policy sampling; divergence under off-policy or non-linear approximation is well-documented (Baird’s counterexample, the “deadly triad” of function approximation + bootstrapping + off-policy).

Complexity and Tradeoffs

Each TD update is \(O(1)\) tabular operations. Memory is one entry per state.

The bias-variance trade-off is fundamental (proof):

Method	Update target	Needs model?	Update timing	Bias	Variance
Dynamic programming	\(\mathbb{E}_\pi[r + \gamma V(s')]\) exactly	yes	full sweep	none	none
Monte Carlo	\(G_t\) (full return)	no	end of episode	none	high
TD(0)	\(r + \gamma V(s')\) (one-step bootstrap)	no	every step	yes (while \(V\) is wrong)	low

TD’s lower variance per update usually translates to faster convergence in practice, even though Monte Carlo is unbiased. The bias is bootstrap bias: the target depends on the current \(V\), which is initially wrong.

When to Use It

Situation	Approach
Estimate \(V^\pi\) for a fixed policy from samples	TD(0).
Want to control (improve policy), on-policy data	SARSA — see Variants.
Want to control, off-policy data	Q-learning.
Bias more harmful than variance (low data, careful estimates)	Monte Carlo or n-step TD with large \(n\).
Continuing (non-episodic) task	TD — Monte Carlo not applicable.
Model known	Use value iteration or policy iteration instead.

Variants

n-step TD. Use the target \(r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})\), interpolating between TD(0) (\(n = 1\)) and Monte Carlo (\(n = \infty\)).
TD(\(\lambda\)) and eligibility traces. A geometrically-weighted average of \(n\)-step targets, parameterized by \(\lambda \in [0, 1]\). Eligibility traces give an efficient online implementation that updates all visited states in proportion to how recently they were visited.
SARSA (on-policy control). Replace \(V\) with action-value \(Q\): \(Q(s, a) \leftarrow Q(s, a) + \alpha[r + \gamma Q(s', a') - Q(s, a)]\) where \(a' \sim \pi\). Learns \(Q^\pi\) for the policy currently being followed.
Q-learning (off-policy control). Replace \(Q(s', a')\) with \(\max_{a'} Q(s', a')\) to learn \(Q^*\) regardless of the behavior policy.
Expected SARSA. Use \(\mathbb{E}_{a' \sim \pi}[Q(s', a')]\) instead of a sampled \(Q(s', a')\). Lower variance than SARSA at the cost of needing to evaluate the policy distribution.
Actor-critic. Use TD to learn a value function (the critic) and use the TD error to update a separately parameterized policy (the actor). Combines TD’s low variance with policy-gradient flexibility.

References

Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. 2nd ed. MIT Press. https://mitpress.mit.edu/9780262039246/reinforcement-learning/.