Tabular Q-Learning Converges to Q*

Claim

Let \((S, A, P, R, \gamma)\) be a finite MDP with \(\gamma \in [0, 1)\) and bounded rewards. Run Q-learning with the update

\[ Q_{t+1}(s_t, a_t) = (1 - \alpha_t(s_t, a_t))\, Q_t(s_t, a_t) + \alpha_t(s_t, a_t)\!\left[r_t + \gamma \max_{a'} Q_t(s_{t+1}, a')\right] \]

and \(Q_{t+1}(s, a) = Q_t(s, a)\) for \((s, a) \ne (s_t, a_t)\). Assume:

Every \((s, a)\) pair is visited infinitely often.
The learning rates satisfy the Robbins–Monro conditions for each \((s, a)\): \(\sum_t \alpha_t(s, a) = \infty\) and \(\sum_t \alpha_t(s, a)^2 < \infty\).
Rewards are bounded: \(|R(s, a)| \le R_{\max}\).

Then \(Q_t \to Q^*\) with probability 1 (Watkins and Dayan 1992).

Setup

Let \(\mathcal{T}^*\) denote the Bellman optimality operator on \(Q\):

\[ (\mathcal{T}^* Q)(s, a) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\!\left[R(s, a) + \gamma \max_{a'} Q(s', a')\right]. \]

\(\mathcal{T}^*\) is a \(\gamma\)-contraction in the \(\ell^\infty\) norm with unique fixed point \(Q^*\) (proof).

Define the error process \(\Delta_t = Q_t - Q^*\). The Q-learning update can be rewritten as a stochastic-approximation recursion driven by \(\Delta_t\) and a martingale-difference noise term.

The Stochastic Approximation Lemma

The proof relies on the following standard result.

Lemma (asynchronous stochastic approximation). Let \(\Delta_t : S \times A \to \mathbb{R}\) evolve by

\[ \Delta_{t+1}(s, a) = (1 - \alpha_t(s, a))\, \Delta_t(s, a) + \alpha_t(s, a)\, F_t(s, a) \]

with respect to a filtration \(\mathcal{F}_t\). Suppose:

\(\sum_t \alpha_t(s, a) = \infty\), \(\sum_t \alpha_t(s, a)^2 < \infty\) for every \((s, a)\), almost surely.
There exists \(\gamma \in [0, 1)\) such that \(\bigl\|\mathbb{E}[F_t \mid \mathcal{F}_t]\bigr\|_\infty \le \gamma \|\Delta_t\|_\infty\).
\(\operatorname{Var}\!\bigl[F_t(s, a) \mid \mathcal{F}_t\bigr] \le K(1 + \|\Delta_t\|_\infty^2)\) for some constant \(K\).

Then \(\Delta_t \to 0\) almost surely.

This is a standard result in stochastic approximation; see Jaakkola, Jordan, and Singh (1994) or Bertsekas and Tsitsiklis (1996) for the proof.

Proof of the Main Claim

Step 1: Express the update as a stochastic approximation.

Let \(\mathcal{F}_t\) be the \(\sigma\)-algebra of all events through time \(t\) (states, actions, rewards, and \(Q_t\)). On the visited pair \((s_t, a_t)\),

\[ Q_{t+1}(s_t, a_t) - Q^*(s_t, a_t) = (1 - \alpha_t)\bigl(Q_t(s_t, a_t) - Q^*(s_t, a_t)\bigr) + \alpha_t F_t(s_t, a_t), \]

where

\[ F_t(s, a) = r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q^*(s, a). \]

For unvisited pairs, \(\alpha_t = 0\) and \(\Delta_{t+1} = \Delta_t\). So \(\Delta_t\) obeys the recursion of the lemma.

Step 2: Verify hypothesis (ii) — contraction in expectation.

The expectation of \(F_t(s, a)\) given \(\mathcal{F}_t\) is

\[ \mathbb{E}[F_t(s, a) \mid \mathcal{F}_t] = R(s, a) + \gamma\, \mathbb{E}_{s' \sim P(\cdot\mid s,a)}\!\left[\max_{a'} Q_t(s', a')\right] - Q^*(s, a) = (\mathcal{T}^* Q_t)(s, a) - Q^*(s, a). \]

Since \(Q^* = \mathcal{T}^* Q^*\), this equals \((\mathcal{T}^* Q_t - \mathcal{T}^* Q^*)(s, a)\). The contraction property of \(\mathcal{T}^*\) gives

\[ \bigl|\mathbb{E}[F_t(s, a) \mid \mathcal{F}_t]\bigr| \le \|\mathcal{T}^* Q_t - \mathcal{T}^* Q^*\|_\infty \le \gamma \|Q_t - Q^*\|_\infty = \gamma \|\Delta_t\|_\infty. \]

Hypothesis (ii) holds.

Step 3: Verify hypothesis (iii) — bounded variance.

The reward is bounded by \(R_{\max}\), and \(\max_{a'} Q_t(s', a')\) is bounded by \(\|Q_t\|_\infty \le \|Q^*\|_\infty + \|\Delta_t\|_\infty\). Hence

\[ \operatorname{Var}[F_t(s, a) \mid \mathcal{F}_t] \le \mathbb{E}[F_t(s, a)^2 \mid \mathcal{F}_t] \le C\bigl(1 + \|\Delta_t\|_\infty^2\bigr) \]

for a constant \(C\) depending on \(R_{\max}\), \(\gamma\), and \(\|Q^*\|_\infty\). Hypothesis (iii) holds.

Step 4: Apply the lemma.

The first hypothesis is the assumption of the theorem. The lemma now gives \(\Delta_t \to 0\) almost surely, i.e., \(Q_t \to Q^*\). \(\blacksquare\)

Remarks

Why the visit-every-pair assumption is essential. If some \((s, a)\) is never updated, \(Q_t(s, a)\) remains at its initial value indefinitely and there is no way to approach \(Q^*(s, a)\). Standard exploration schemes — \(\varepsilon\)-greedy with \(\varepsilon > 0\), soft policies — guarantee infinite visits in finite ergodic MDPs.

Off-policy is invisible to the proof. The argument never used the fact that the actions \(a_t\) were chosen by any particular policy. As long as the visit assumption holds, the update converges. This is the formal statement of “Q-learning is off-policy.”

No rate of convergence. The theorem is asymptotic. Concrete rates require additional structure (mixing assumptions, geometric ergodicity); even constant-step-size Q-learning has only recently received tight finite-sample analyses.

Function approximation breaks the proof. Step 2 used the fact that the table-update operator commutes with \(\mathcal{T}^*\) — projecting onto a function-approximator class generally does not, and the resulting projected operator can fail to contract. This is one face of the “deadly triad.”

References

Watkins, Christopher J. C. H., and Peter Dayan. 1992. “Q-Learning.” Machine Learning 8 (3-4): 279–92. https://doi.org/10.1007/bf00992698.