Q-Learning

Motivation

Value iteration and policy iteration require a known model — they sweep over states using \(P\) and \(R\) to compute expectations. Q-learning (Watkins and Dayan 1992) drops that requirement: it learns the optimal action-value function \(Q^*\) directly from sampled transitions, without ever building a model and without ever following the optimal policy.

Q-learning is the prototypical off-policy reinforcement learning algorithm. The agent can act according to any exploratory behavior policy and still converge to \(Q^*\), because the update target uses the greedy action rather than the action actually taken.

Problem

A finite MDP with unknown \(P\) and \(R\). The agent observes transitions \((s, a, r, s')\) generated by following some behavior policy \(\mu\) that visits every \((s, a)\) pair infinitely often. The goal is to estimate \(Q^*(s, a) = \max_\pi Q^\pi(s, a)\), from which the optimal policy follows as \(\pi^*(s) = \arg\max_a Q^*(s, a)\).

Key Ideas

Stochastic approximation of the Bellman operator

After each observed transition \((s_t, a_t, r_t, s_{t+1})\), the Q-learning update

\[ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\!\left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right] \]

pulls \(Q(s_t, a_t)\) toward a sample of the Bellman optimality target. The expectation of that target is exactly the Bellman optimality operator \(\mathcal{T}^*\) applied to \(Q\):

\[ (\mathcal{T}^* Q)(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q(s', a')\right]. \]

So Q-learning is a stochastic approximation of \(Q \leftarrow \mathcal{T}^* Q\), using one sample per update. The fixed point is \(Q^*\).

Off-policy: learning the greedy policy while behaving exploratorily

The update target uses \(\max_{a'} Q(s', a')\) — the value under the greedy policy — regardless of which action \(a'\) the agent actually takes next. This decouples learning from behavior: the agent can explore aggressively (e.g., \(\varepsilon\)-greedy) while still learning about the greedy policy.

The on-policy alternative, SARSA, replaces \(\max_{a'} Q(s', a')\) with \(Q(s', a')\) where \(a' \sim \mu\). SARSA learns \(Q^\mu\) and tracks the behavior policy; Q-learning learns \(Q^*\) regardless of \(\mu\). The off-policy property is what makes Q-learning compatible with replay buffers, batch RL, and learning from logged data.

Exploration is required, not optional

Q-learning’s convergence requires every \((s, a)\) to be visited infinitely often. A purely greedy policy quickly stops trying suboptimal-looking actions, so an explicit exploration mechanism — \(\varepsilon\)-greedy, Boltzmann, UCB — is necessary. The dichotomy of exploration vs. exploitation is built into the algorithm.

Algorithm

initialize Q(s, a) arbitrarily; Q(terminal, ·) = 0
for each episode:
    s = initial state
    while s is not terminal:
        a = behavior_policy(s)              # e.g., epsilon-greedy w.r.t. Q
        take action a; observe r, s'
        Q(s, a) += alpha * (r + gamma * max_a' Q(s', a') - Q(s, a))
        s = s'

Two design choices control behavior:

  • Behavior policy. Must visit every \((s, a)\) pair infinitely often. \(\varepsilon\)-greedy with respect to \(Q\) is the standard choice; the \(\varepsilon\) provides exploration, the greedy part provides exploitation.
  • Learning rate \(\alpha\). Must satisfy Robbins–Monro: \(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\). Constant \(\alpha\) is common in practice — it does not satisfy the conditions but converges to a useful approximate \(Q^*\) in nonstationary or function-approximation settings.

Walkthrough

Two-state MDP, one episode

Use the MDP from value iteration: \(S = \{0, 1\}\), \(A = \{\text{stay}, \text{switch}\}\), \(\gamma = 0.9\), and rewards giving optimal \(Q^*(0, \text{stay}) = 10\), \(Q^*(1, \text{switch}) = 9\).

Initialize \(Q \equiv 0\). Use \(\alpha = 0.5\) and behavior policy “uniform random” for clarity.

Step \(s\) \(a\) \(r\) \(s'\) \(\max_{a'} Q(s', a')\) Target New \(Q(s, a)\)
1 0 stay 1 0 \(0\) \(1 + 0.9 \cdot 0 = 1\) \(0 + 0.5(1 - 0) = 0.5\)
2 0 switch 0 1 \(0\) \(0 + 0.9 \cdot 0 = 0\) \(0\)
3 1 switch 0 0 \(0.5\) \(0 + 0.9 \cdot 0.5 = 0.45\) \(0 + 0.5(0.45 - 0) = 0.225\)
4 0 stay 1 0 \(0.5\) \(1 + 0.9 \cdot 0.5 = 1.45\) \(0.5 + 0.5(1.45 - 0.5) = 0.975\)

After just four transitions, \(Q(0, \text{stay})\) has climbed from \(0\) toward its true value of \(10\), and \(Q(1, \text{switch})\) is making progress toward \(9\). The interesting feature is step \(3\): even though the agent took the action that led back to state \(0\), the target uses \(\max_{a'} Q(0, a') = 0.5\) — the value of \(Q(0, \text{stay})\), regardless of whether the agent will actually take that action next. This is off-policy bootstrap in action.

Many thousands of transitions later, \(Q\) converges to \(Q^*\).

Correctness

Tabular Q-learning converges to \(Q^*\) with probability 1 under the conditions above: every \((s, a)\) visited infinitely often, and \(\alpha_t\) satisfying Robbins–Monro (Watkins and Dayan 1992). (proof)

The proof is a stochastic-approximation argument: the Bellman optimality operator is a \(\gamma\)-contraction in the \(\ell^\infty\) norm (proof), and Q-learning is a noisy fixed-point iteration of that operator. The noise has zero mean conditional on the current \(Q\) and bounded variance, so standard SA arguments apply.

Complexity and Tradeoffs

Each update is \(O(|A|)\) time (the max over actions) and \(O(1)\) memory beyond the \(Q\) table. The total experience needed to reach a target accuracy is highly problem-dependent — see “Sample efficiency” below.

Sample efficiency. Tabular Q-learning needs each \((s, a)\) visited many times, so the sample complexity scales with \(|S| |A|\). Function approximation generalizes across similar states at the cost of stability.

Maximization bias. \(\max_a Q(s, a)\) overestimates \(\max_a Q^*(s, a)\) when \(Q\) has noise — Jensen’s inequality on a max of noisy estimates. Double Q-learning (and Double DQN) correct this with two independent estimators, using one to select the action and the other to evaluate it.

Discrete action spaces. \(\max_{a'} Q(s', a')\) requires enumerating actions. Continuous action spaces need policy-based methods (e.g., DDPG, SAC) or actor-critic methods that approximate the maximization.

When to Use It

Situation Approach
Small tabular MDP, samples available Q-learning.
Want to learn from logged / off-policy data Q-learning (off-policy by construction).
On-policy data only, want behavior to match learned policy SARSA.
Large or continuous state, discrete actions DQN — see Variants.
Continuous actions DDPG, SAC, TD3, or other actor-critic methods.
Model is known Value iteration or policy iteration.

Variants

  • SARSA (on-policy TD control). Replace \(\max_{a'} Q(s', a')\) with \(Q(s', a')\) where \(a' \sim \mu\). Learns \(Q^\mu\). Safer in environments where exploration is costly (e.g., cliff-walking) because it values the actions actually taken.
  • Double Q-learning. Two Q-functions \(Q_1, Q_2\) trained alternately; the target uses \(Q_2(s', \arg\max_{a'} Q_1(s', a'))\) or vice versa. Removes maximization bias.
  • Deep Q-Network (DQN). Replaces tabular \(Q\) with a neural network \(Q_\theta\). The combination of bootstrapping, off-policy learning, and function approximation is the “deadly triad” — DQN restores stability with a replay buffer (sample past transitions uniformly to break temporal correlation) and a target network (\(\theta^-\) updated slowly, used in the bootstrap target). The semi-gradient update is \[ \theta \leftarrow \theta + \alpha \bigl(r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a)\bigr) \nabla_\theta Q_\theta(s, a). \] DQN’s success on Atari (Mnih et al. 2015) launched modern deep reinforcement learning.
  • Double DQN, dueling DQN, prioritized replay, distributional DQN. Successive refinements of the DQN template, each addressing one specific failure mode (overestimation, value/advantage decomposition, non-uniform sample importance, return distributions).
  • n-step Q-learning. Use the \(n\)-step return \(r_t + \gamma r_{t+1} + \cdots + \gamma^n \max_{a'} Q(s_{t+n}, a')\) as the target. Trades bias for variance; \(n = 1\) is plain Q-learning, \(n = \infty\) is Monte Carlo.

References

Watkins, Christopher J. C. H., and Peter Dayan. 1992. “Q-Learning.” Machine Learning 8 (3-4): 279–92. https://doi.org/10.1007/bf00992698.