Linear Bandits

Motivation

Multi-armed bandits treat each arm’s reward distribution as completely unrelated to the others. When the arms correspond to feature vectors in \(\mathbb{R}^d\) and rewards are linear in those features, the agent can transfer information from one arm to all others. Linear bandits formalize this: the action set may be infinite or continuous, the reward model is linear and parametric, and the exploration bonus has a natural geometric interpretation as an ellipsoidal confidence set around the unknown parameter.

This framework covers both the contextual bandit setting (discrete arms, per-arm feature vectors) and more general settings with arbitrary or continuous action sets.

Setup

At each round \(t = 1, 2, \ldots, T\):

The agent observes an action set \(\mathcal{A}_t \subseteq \mathbb{R}^d\) (possibly the same set every round, possibly continuous).
The agent selects \(a_t \in \mathcal{A}_t\).
The environment returns \(r_t = a_t^\top \theta^* + \eta_t\), where \(\theta^* \in \mathbb{R}^d\) is an unknown parameter vector with \(\|\theta^*\| \leq S\), and \(\eta_t\) is \(\sigma\)-sub-Gaussian noise conditionally on \(a_1, \ldots, a_t\).

Assume actions are bounded: \(\|a\| \leq L\) for all \(a \in \mathcal{A}_t\). The regret is

\[ \text{Regret}(T) = \sum_{t=1}^{T} \max_{a \in \mathcal{A}_t} a^\top \theta^* - a_t^\top \theta^*. \]

Ridge Regression and Confidence Sets

After \(t\) rounds, the regularized design matrix and ridge estimate are

\[ A_t = \lambda I + \sum_{s=1}^{t} a_s a_s^\top, \qquad \hat{\theta}_t = A_t^{-1} \sum_{s=1}^{t} r_s a_s, \]

where \(\lambda > 0\) is the regularization parameter. This is the minimizer of the regularized least-squares objective \(\sum_s (r_s - a_s^\top \theta)^2 + \lambda \|\theta\|^2\).

The ellipsoidal confidence set around \(\hat{\theta}_t\) is

\[ \mathcal{C}_t = \bigl\{ \theta \in \mathbb{R}^d : \|\hat{\theta}_t - \theta\|_{A_t} \leq \beta_t \bigr\}, \]

where \(\|\cdot\|_{A_t}\) denotes the norm \(\|v\|_{A_t} = \sqrt{v^\top A_t v}\) and

\[ \beta_t = \sigma \sqrt{2 \log\!\tfrac{1}{\delta} + \log\!\tfrac{\det A_t}{\det(\lambda I)}} + \sqrt{\lambda}\, S. \]

By a self-normalized martingale inequality (Abbasi-Yadkori et al. 2011), \(\theta^* \in \mathcal{C}_t\) for all \(t \geq 0\) simultaneously with probability at least \(1 - \delta\). This is the core confidence guarantee that drives both LinUCB and OFUL.

LinUCB

LinUCB (Li et al. 2010) targets the contextual bandit setting with \(K\) discrete arms, each carrying the same context vector \(x_t \in \mathbb{R}^d\) but a separate parameter \(\theta_a^*\). It maintains one design matrix \(A_a\) per arm and selects

\[ a_t = \operatorname*{arg\,max}_{a \in \{1,\ldots,K\}} \left[ x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t} \right]. \]

The bonus \(\sqrt{x_t^\top A_a^{-1} x_t} = \|x_t\|_{A_a^{-1}}\) is the half-width of the confidence interval on \(x_t^\top \theta_a^*\). This implements optimism in the face of uncertainty: arms whose upper confidence bound is highest get pulled, whether because the estimate is high or the uncertainty is high.

LinUCB corresponds to OFUL applied separately to each arm. Sharing a joint parameter via feature vectors \(\phi(x_t, a)\) and a single design matrix recovers the full OFUL generality.

OFUL

OFUL (Optimism in the Face of Uncertainty for Linear bandits, (Abbasi-Yadkori et al. 2011)) handles arbitrary action sets by maintaining a single design matrix over all played actions. At each round it solves the joint optimization

\[ a_t = \operatorname*{arg\,max}_{a \in \mathcal{A}_t} \max_{\theta \in \mathcal{C}_{t-1}} a^\top \theta. \]

The inner maximization over \(\theta \in \mathcal{C}_{t-1}\) has a closed form: because \(\mathcal{C}_{t-1}\) is an ellipsoid centered at \(\hat{\theta}_{t-1}\),

\[ \max_{\theta \in \mathcal{C}_{t-1}} a^\top \theta = a^\top \hat{\theta}_{t-1} + \beta_{t-1} \|a\|_{A_{t-1}^{-1}}. \]

So the selection rule is

\[ a_t = \operatorname*{arg\,max}_{a \in \mathcal{A}_t} \left[ a^\top \hat{\theta}_{t-1} + \beta_{t-1} \|a\|_{A_{t-1}^{-1}} \right]. \]

This is exactly the UCB principle in \(\mathbb{R}^d\): the bonus \(\beta_{t-1}\|a\|_{A_{t-1}^{-1}}\) is the maximum possible gap between the true and estimated payoff along direction \(a\), given the current confidence ellipsoid.

After selecting \(a_t\) and observing \(r_t\), update:

\[ A_t = A_{t-1} + a_t a_t^\top, \qquad \hat{\theta}_t = A_t^{-1}\!\left( A_{t-1}\hat{\theta}_{t-1} + r_t a_t \right). \]

The rank-1 update to \(A_t\) and the corresponding update to \(A_t^{-1}\) (via the Sherman–Morrison formula) cost \(O(d^2)\) per round; computing \(\|a\|_{A^{-1}}\) costs \(O(d^2)\) per candidate action.

Regret Bound

OFUL achieves (proof):

\[ \text{Regret}(T) \leq 2\beta_T \sqrt{2T\, d \log\!\left(1 + \frac{TL^2}{\lambda d}\right)}, \]

which is \(\tilde{O}(d\sqrt{T})\). The logarithmic factors come from the growth of \(\beta_T\) and the potential lemma: the total squared exploration bonus is bounded by \(O(d \log T)\) regardless of which actions are played.

The \(d\sqrt{T}\) rate is minimax optimal for linear bandits: no algorithm can achieve \(o(d\sqrt{T})\) regret in the worst case over \(\theta^*\) and the action sets.

The Potential Lemma

The key technical fact is that the design matrix cannot grow without limit in all directions simultaneously. Specifically, using the matrix determinant lemma \(\det A_t = \det A_{t-1} \cdot (1 + \|a_t\|_{A_{t-1}^{-1}}^2)\):

\[ \prod_{t=1}^{T} \bigl(1 + \|a_t\|_{A_{t-1}^{-1}}^2\bigr) = \frac{\det A_T}{\det A_0} \leq \left(1 + \frac{TL^2}{\lambda d}\right)^d, \]

where the last inequality uses \(\operatorname{tr}(A_T) \leq \lambda d + TL^2\) and the AM–GM bound on the determinant. Taking logarithms and using \(\log(1+x) \leq x\):

\[ \sum_{t=1}^{T} \log\bigl(1 + \|a_t\|_{A_{t-1}^{-1}}^2\bigr) \leq d \log\!\left(1 + \frac{TL^2}{\lambda d}\right). \]

Since \(\log(1+x) \geq x/2\) for \(x \in [0,1]\), this bounds \(\sum_t \min(1, \|a_t\|_{A_{t-1}^{-1}}^2)\) and, by Cauchy–Schwarz, controls \(\sum_t \|a_t\|_{A_{t-1}^{-1}}\).

Comparison

	LinUCB	OFUL
Action set	Discrete arms, per-arm models	Arbitrary \(\mathcal{A}_t \subseteq \mathbb{R}^d\)
Parameters	\(K\) per-arm vectors \(\theta_a^*\)	Single vector \(\theta^*\)
Design matrix	One per arm	One shared
Confidence set	Per-arm ellipsoid	Global ellipsoid
Information sharing	None across arms	Full (one \(\theta^*\))

When arms have disjoint feature supports LinUCB and OFUL with a joint feature map are equivalent. When arms share a common \(\theta^*\), OFUL’s shared matrix is strictly more sample-efficient.

References

(Li et al. 2010; Abbasi-Yadkori et al. 2011; Lattimore and Szepesvári 2020).

References

Abbasi-Yadkori, Yasin, Dávid Pál, and Csaba Szepesvári. 2011. “Improved Algorithms for Linear Stochastic Bandits.” Advances in Neural Information Processing Systems (NeurIPS) 24. https://proceedings.neurips.cc/paper/2011/hash/e1d5be1c7f2f456670de3d53c7b54f4a-Abstract.html.

Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC.

Li, Lihong, Wei Chu, John Langford, and Robert E. Schapire. 2010. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” Proceedings of the 19th International Conference on World Wide Web, WWW ’10, April, 661–70. https://doi.org/10.1145/1772690.1772758.