OFUL Achieves \(\tilde{O}(d\sqrt{T})\) Regret

Claim

Consider the linear bandit problem with actions \(\|a\| \leq L\), unknown \(\theta^*\) with \(\|\theta^*\| \leq S\), \(\sigma\)-sub-Gaussian noise, and regularization \(\lambda > 0\). Define

\[ \beta_t = \sigma\sqrt{2\log\!\tfrac{1}{\delta} + d\log\!\left(1 + \tfrac{tL^2}{\lambda d}\right)} + \sqrt{\lambda}\,S. \]

Then with probability at least \(1 - \delta\), the OFUL algorithm satisfies

\[ \text{Regret}(T) \leq 2\beta_T \sqrt{2Td\log\!\left(1 + \frac{TL^2}{\lambda d}\right)}. \]

This is \(\tilde{O}(d\sqrt{T})\) after choosing \(\lambda = 1\) and \(\delta = 1/T\).

Step 1: Confidence Set Contains \(\theta^*\)

The ridge estimator satisfies the identity

\[ \hat{\theta}_t - \theta^* = A_t^{-1} \sum_{s=1}^t a_s \eta_s - \lambda A_t^{-1} \theta^*, \]

where \(A_t = \lambda I + \sum_s a_s a_s^\top\). The first term is a weighted sum of martingale differences. By the self-normalized martingale inequality (Abbasi-Yadkori et al. 2011, Theorem 1), with probability at least \(1 - \delta\), for all \(t \geq 0\) simultaneously:

\[ \left\| \sum_{s=1}^t a_s \eta_s \right\|_{A_t^{-1}} \leq \sigma\sqrt{2\log\!\tfrac{1}{\delta} + \log\!\tfrac{\det A_t}{\det(\lambda I)}}. \]

Combining with \(\lambda \|A_t^{-1} \theta^*\|_{A_t} \leq \sqrt{\lambda}\|\theta^*\| \leq \sqrt{\lambda}S\) and using \(\log(\det A_t / \det(\lambda I)) \leq d\log(1 + tL^2/(\lambda d))\) gives \(\|\hat{\theta}_t - \theta^*\|_{A_t} \leq \beta_t\), so \(\theta^* \in \mathcal{C}_t\) for all \(t\) on a probability-\(1-\delta\) event. We condition on this event for the rest of the proof.

Step 2: Optimism Bounds Per-Round Regret

Let \(a_t^* = \arg\max_{a \in \mathcal{A}_t} a^\top \theta^*\) be the optimal action. By the confidence set, \(\theta^* \in \mathcal{C}_{t-1}\), so

\[ \max_{\theta \in \mathcal{C}_{t-1}} (a_t^*)^\top \theta \geq (a_t^*)^\top \theta^*. \]

OFUL selects \(a_t\) to maximize the left-hand side over \(\mathcal{A}_t\), so

\[ \max_{\theta \in \mathcal{C}_{t-1}} a_t^\top \theta \geq (a_t^*)^\top \theta^*. \]

The closed-form solution of \(\max_{\theta \in \mathcal{C}_{t-1}} a_t^\top \theta\) gives \(a_t^\top \hat{\theta}_{t-1} + \beta_{t-1}\|a_t\|_{A_{t-1}^{-1}}\). The per-round regret is therefore

\[ \Delta_t = (a_t^*)^\top \theta^* - a_t^\top \theta^* \leq a_t^\top \hat{\theta}_{t-1} + \beta_{t-1}\|a_t\|_{A_{t-1}^{-1}} - a_t^\top \theta^*. \]

By Cauchy–Schwarz in the \(A_{t-1}\)-inner product:

\[ a_t^\top(\hat{\theta}_{t-1} - \theta^*) \leq \|a_t\|_{A_{t-1}^{-1}} \|\hat{\theta}_{t-1} - \theta^*\|_{A_{t-1}} \leq \beta_{t-1} \|a_t\|_{A_{t-1}^{-1}}. \]

Combining:

\[ \Delta_t \leq 2\beta_{t-1} \|a_t\|_{A_{t-1}^{-1}}. \]

Step 3: Sum Using the Potential Lemma

Summing over rounds, and using that \(\beta_t\) is non-decreasing in \(t\):

\[ \text{Regret}(T) \leq 2\beta_T \sum_{t=1}^T \|a_t\|_{A_{t-1}^{-1}}. \]

Apply Cauchy–Schwarz:

\[ \sum_{t=1}^T \|a_t\|_{A_{t-1}^{-1}} \leq \sqrt{T} \cdot \sqrt{\sum_{t=1}^T \|a_t\|_{A_{t-1}^{-1}}^2}. \]

Potential lemma. By the matrix determinant lemma, \(\det A_t = \det A_{t-1} \cdot (1 + \|a_t\|_{A_{t-1}^{-1}}^2)\), so

\[ \prod_{t=1}^T \bigl(1 + \|a_t\|_{A_{t-1}^{-1}}^2\bigr) = \frac{\det A_T}{\det A_0} \leq \left(1 + \frac{TL^2}{\lambda d}\right)^d, \]

where the inequality uses \(\operatorname{tr}(A_T) \leq \lambda d + TL^2\) together with the AM–GM bound \(\det M \leq (\operatorname{tr}(M)/d)^d\). Taking logarithms and using \(\log(1+x) \geq x/(1+x) \geq x/2\) for \(x \in [0,1]\):

\[ \sum_{t=1}^T \log\bigl(1 + \|a_t\|_{A_{t-1}^{-1}}^2\bigr) \leq d\log\!\left(1 + \frac{TL^2}{\lambda d}\right). \]

Since \(x/2 \leq \log(1+x)\) implies \(\sum_t \|a_t\|_{A_{t-1}^{-1}}^2 \leq 2d\log(1 + TL^2/(\lambda d))\) whenever each \(\|a_t\|_{A_{t-1}^{-1}}^2 \leq 1\) (which can be ensured by scaling), substituting gives

\[ \text{Regret}(T) \leq 2\beta_T \sqrt{T} \cdot \sqrt{2d\log\!\left(1 + \frac{TL^2}{\lambda d}\right)} = 2\beta_T\sqrt{2Td\log\!\left(1 + \frac{TL^2}{\lambda d}\right)}. \]

Remarks

  • The self-normalized bound (Step 1) is the key improvement over a naive Hoeffding argument, which would require the actions \(a_t\) to be fixed in advance. Here actions are chosen adaptively, creating a martingale structure that requires a more careful analysis.
  • Setting \(\lambda = 1\), \(S = 1\), \(\sigma = 1\), \(L = 1\), \(\delta = 1/T\) gives \(\beta_T = O(\sqrt{d \log T})\), so \(\text{Regret}(T) = O(d\sqrt{T\log T}) = \tilde{O}(d\sqrt{T})\).
  • The \(d\sqrt{T}\) rate is minimax optimal: the lower bound \(\Omega(d\sqrt{T})\) follows by embedding \(d\) independent one-dimensional bandit problems (Lattimore and Szepesvári 2020).

References

(Abbasi-Yadkori et al. 2011; Lattimore and Szepesvári 2020).

References

Abbasi-Yadkori, Yasin, Dávid Pál, and Csaba Szepesvári. 2011. “Improved Algorithms for Linear Stochastic Bandits.” Advances in Neural Information Processing Systems (NeurIPS) 24. https://proceedings.neurips.cc/paper/2011/hash/e1d5be1c7f2f456670de3d53c7b54f4a-Abstract.html.
Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC.