Tabular Q-Learning Converges to Q*
Claim
Let \((S, A, P, R, \gamma)\) be a finite MDP with \(\gamma \in [0, 1)\) and bounded rewards. Run Q-learning with the update
\[ Q_{t+1}(s_t, a_t) = (1 - \alpha_t(s_t, a_t))\, Q_t(s_t, a_t) + \alpha_t(s_t, a_t)\!\left[r_t + \gamma \max_{a'} Q_t(s_{t+1}, a')\right] \]
and \(Q_{t+1}(s, a) = Q_t(s, a)\) for \((s, a) \ne (s_t, a_t)\). Assume:
- Every \((s, a)\) pair is visited infinitely often.
- The learning rates satisfy the Robbins–Monro conditions for each \((s, a)\): \(\sum_t \alpha_t(s, a) = \infty\) and \(\sum_t \alpha_t(s, a)^2 < \infty\).
- Rewards are bounded: \(|R(s, a)| \le R_{\max}\).
Then \(Q_t \to Q^*\) with probability 1 (Watkins and Dayan 1992).
Setup
Let \(\mathcal{T}^*\) denote the Bellman optimality operator on \(Q\):
\[ (\mathcal{T}^* Q)(s, a) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\!\left[R(s, a) + \gamma \max_{a'} Q(s', a')\right]. \]
\(\mathcal{T}^*\) is a \(\gamma\)-contraction in the \(\ell^\infty\) norm with unique fixed point \(Q^*\) (proof).
Define the error process \(\Delta_t = Q_t - Q^*\). The Q-learning update can be rewritten as a stochastic-approximation recursion driven by \(\Delta_t\) and a martingale-difference noise term.
The Stochastic Approximation Lemma
The proof relies on the following standard result.
Lemma (asynchronous stochastic approximation). Let \(\Delta_t : S \times A \to \mathbb{R}\) evolve by
\[ \Delta_{t+1}(s, a) = (1 - \alpha_t(s, a))\, \Delta_t(s, a) + \alpha_t(s, a)\, F_t(s, a) \]
with respect to a filtration \(\mathcal{F}_t\). Suppose:
\(\sum_t \alpha_t(s, a) = \infty\), \(\sum_t \alpha_t(s, a)^2 < \infty\) for every \((s, a)\), almost surely.
There exists \(\gamma \in [0, 1)\) such that \(\bigl\|\mathbb{E}[F_t \mid \mathcal{F}_t]\bigr\|_\infty \le \gamma \|\Delta_t\|_\infty\).
\(\operatorname{Var}\!\bigl[F_t(s, a) \mid \mathcal{F}_t\bigr] \le K(1 + \|\Delta_t\|_\infty^2)\) for some constant \(K\).
Then \(\Delta_t \to 0\) almost surely.
This is a standard result in stochastic approximation; see Jaakkola, Jordan, and Singh (1994) or Bertsekas and Tsitsiklis (1996) for the proof.
Proof of the Main Claim
Step 1: Express the update as a stochastic approximation.
Let \(\mathcal{F}_t\) be the \(\sigma\)-algebra of all events through time \(t\) (states, actions, rewards, and \(Q_t\)). On the visited pair \((s_t, a_t)\),
\[ Q_{t+1}(s_t, a_t) - Q^*(s_t, a_t) = (1 - \alpha_t)\bigl(Q_t(s_t, a_t) - Q^*(s_t, a_t)\bigr) + \alpha_t F_t(s_t, a_t), \]
where
\[ F_t(s, a) = r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q^*(s, a). \]
For unvisited pairs, \(\alpha_t = 0\) and \(\Delta_{t+1} = \Delta_t\). So \(\Delta_t\) obeys the recursion of the lemma.
Step 2: Verify hypothesis (ii) — contraction in expectation.
The expectation of \(F_t(s, a)\) given \(\mathcal{F}_t\) is
\[ \mathbb{E}[F_t(s, a) \mid \mathcal{F}_t] = R(s, a) + \gamma\, \mathbb{E}_{s' \sim P(\cdot\mid s,a)}\!\left[\max_{a'} Q_t(s', a')\right] - Q^*(s, a) = (\mathcal{T}^* Q_t)(s, a) - Q^*(s, a). \]
Since \(Q^* = \mathcal{T}^* Q^*\), this equals \((\mathcal{T}^* Q_t - \mathcal{T}^* Q^*)(s, a)\). The contraction property of \(\mathcal{T}^*\) gives
\[ \bigl|\mathbb{E}[F_t(s, a) \mid \mathcal{F}_t]\bigr| \le \|\mathcal{T}^* Q_t - \mathcal{T}^* Q^*\|_\infty \le \gamma \|Q_t - Q^*\|_\infty = \gamma \|\Delta_t\|_\infty. \]
Hypothesis (ii) holds.
Step 3: Verify hypothesis (iii) — bounded variance.
The reward is bounded by \(R_{\max}\), and \(\max_{a'} Q_t(s', a')\) is bounded by \(\|Q_t\|_\infty \le \|Q^*\|_\infty + \|\Delta_t\|_\infty\). Hence
\[ \operatorname{Var}[F_t(s, a) \mid \mathcal{F}_t] \le \mathbb{E}[F_t(s, a)^2 \mid \mathcal{F}_t] \le C\bigl(1 + \|\Delta_t\|_\infty^2\bigr) \]
for a constant \(C\) depending on \(R_{\max}\), \(\gamma\), and \(\|Q^*\|_\infty\). Hypothesis (iii) holds.
Step 4: Apply the lemma.
The first hypothesis is the assumption of the theorem. The lemma now gives \(\Delta_t \to 0\) almost surely, i.e., \(Q_t \to Q^*\). \(\blacksquare\)
Remarks
Why the visit-every-pair assumption is essential. If some \((s, a)\) is never updated, \(Q_t(s, a)\) remains at its initial value indefinitely and there is no way to approach \(Q^*(s, a)\). Standard exploration schemes — \(\varepsilon\)-greedy with \(\varepsilon > 0\), soft policies — guarantee infinite visits in finite ergodic MDPs.
Off-policy is invisible to the proof. The argument never used the fact that the actions \(a_t\) were chosen by any particular policy. As long as the visit assumption holds, the update converges. This is the formal statement of “Q-learning is off-policy.”
No rate of convergence. The theorem is asymptotic. Concrete rates require additional structure (mixing assumptions, geometric ergodicity); even constant-step-size Q-learning has only recently received tight finite-sample analyses.
Function approximation breaks the proof. Step 2 used the fact that the table-update operator commutes with \(\mathcal{T}^*\) — projecting onto a function-approximator class generally does not, and the resulting projected operator can fail to contract. This is one face of the “deadly triad.”