Linear Bandits
Motivation
Multi-armed bandits treat each arm’s reward distribution as completely unrelated to the others. When the arms correspond to feature vectors in \(\mathbb{R}^d\) and rewards are linear in those features, the agent can transfer information from one arm to all others. Linear bandits formalize this: the action set may be infinite or continuous, the reward model is linear and parametric, and the exploration bonus has a natural geometric interpretation as an ellipsoidal confidence set around the unknown parameter.
This framework covers both the contextual bandit setting (discrete arms, per-arm feature vectors) and more general settings with arbitrary or continuous action sets.
Setup
At each round \(t = 1, 2, \ldots, T\):
- The agent observes an action set \(\mathcal{A}_t \subseteq \mathbb{R}^d\) (possibly the same set every round, possibly continuous).
- The agent selects \(a_t \in \mathcal{A}_t\).
- The environment returns \(r_t = a_t^\top \theta^* + \eta_t\), where \(\theta^* \in \mathbb{R}^d\) is an unknown parameter vector with \(\|\theta^*\| \leq S\), and \(\eta_t\) is \(\sigma\)-sub-Gaussian noise conditionally on \(a_1, \ldots, a_t\).
Assume actions are bounded: \(\|a\| \leq L\) for all \(a \in \mathcal{A}_t\). The regret is
\[ \text{Regret}(T) = \sum_{t=1}^{T} \max_{a \in \mathcal{A}_t} a^\top \theta^* - a_t^\top \theta^*. \]
Ridge Regression and Confidence Sets
After \(t\) rounds, the regularized design matrix and ridge estimate are
\[ A_t = \lambda I + \sum_{s=1}^{t} a_s a_s^\top, \qquad \hat{\theta}_t = A_t^{-1} \sum_{s=1}^{t} r_s a_s, \]
where \(\lambda > 0\) is the regularization parameter. This is the minimizer of the regularized least-squares objective \(\sum_s (r_s - a_s^\top \theta)^2 + \lambda \|\theta\|^2\).
The ellipsoidal confidence set around \(\hat{\theta}_t\) is
\[ \mathcal{C}_t = \bigl\{ \theta \in \mathbb{R}^d : \|\hat{\theta}_t - \theta\|_{A_t} \leq \beta_t \bigr\}, \]
where \(\|\cdot\|_{A_t}\) denotes the norm \(\|v\|_{A_t} = \sqrt{v^\top A_t v}\) and
\[ \beta_t = \sigma \sqrt{2 \log\!\tfrac{1}{\delta} + \log\!\tfrac{\det A_t}{\det(\lambda I)}} + \sqrt{\lambda}\, S. \]
By a self-normalized martingale inequality (Abbasi-Yadkori et al. 2011), \(\theta^* \in \mathcal{C}_t\) for all \(t \geq 0\) simultaneously with probability at least \(1 - \delta\). This is the core confidence guarantee that drives both LinUCB and OFUL.
LinUCB
LinUCB (Li et al. 2010) targets the contextual bandit setting with \(K\) discrete arms, each carrying the same context vector \(x_t \in \mathbb{R}^d\) but a separate parameter \(\theta_a^*\). It maintains one design matrix \(A_a\) per arm and selects
\[ a_t = \operatorname*{arg\,max}_{a \in \{1,\ldots,K\}} \left[ x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t} \right]. \]
The bonus \(\sqrt{x_t^\top A_a^{-1} x_t} = \|x_t\|_{A_a^{-1}}\) is the half-width of the confidence interval on \(x_t^\top \theta_a^*\). This implements optimism in the face of uncertainty: arms whose upper confidence bound is highest get pulled, whether because the estimate is high or the uncertainty is high.
LinUCB corresponds to OFUL applied separately to each arm. Sharing a joint parameter via feature vectors \(\phi(x_t, a)\) and a single design matrix recovers the full OFUL generality.
OFUL
OFUL (Optimism in the Face of Uncertainty for Linear bandits, (Abbasi-Yadkori et al. 2011)) handles arbitrary action sets by maintaining a single design matrix over all played actions. At each round it solves the joint optimization
\[ a_t = \operatorname*{arg\,max}_{a \in \mathcal{A}_t} \max_{\theta \in \mathcal{C}_{t-1}} a^\top \theta. \]
The inner maximization over \(\theta \in \mathcal{C}_{t-1}\) has a closed form: because \(\mathcal{C}_{t-1}\) is an ellipsoid centered at \(\hat{\theta}_{t-1}\),
\[ \max_{\theta \in \mathcal{C}_{t-1}} a^\top \theta = a^\top \hat{\theta}_{t-1} + \beta_{t-1} \|a\|_{A_{t-1}^{-1}}. \]
So the selection rule is
\[ a_t = \operatorname*{arg\,max}_{a \in \mathcal{A}_t} \left[ a^\top \hat{\theta}_{t-1} + \beta_{t-1} \|a\|_{A_{t-1}^{-1}} \right]. \]
This is exactly the UCB principle in \(\mathbb{R}^d\): the bonus \(\beta_{t-1}\|a\|_{A_{t-1}^{-1}}\) is the maximum possible gap between the true and estimated payoff along direction \(a\), given the current confidence ellipsoid.
After selecting \(a_t\) and observing \(r_t\), update:
\[ A_t = A_{t-1} + a_t a_t^\top, \qquad \hat{\theta}_t = A_t^{-1}\!\left( A_{t-1}\hat{\theta}_{t-1} + r_t a_t \right). \]
The rank-1 update to \(A_t\) and the corresponding update to \(A_t^{-1}\) (via the Sherman–Morrison formula) cost \(O(d^2)\) per round; computing \(\|a\|_{A^{-1}}\) costs \(O(d^2)\) per candidate action.
Regret Bound
OFUL achieves (proof):
\[ \text{Regret}(T) \leq 2\beta_T \sqrt{2T\, d \log\!\left(1 + \frac{TL^2}{\lambda d}\right)}, \]
which is \(\tilde{O}(d\sqrt{T})\). The logarithmic factors come from the growth of \(\beta_T\) and the potential lemma: the total squared exploration bonus is bounded by \(O(d \log T)\) regardless of which actions are played.
The \(d\sqrt{T}\) rate is minimax optimal for linear bandits: no algorithm can achieve \(o(d\sqrt{T})\) regret in the worst case over \(\theta^*\) and the action sets.
The Potential Lemma
The key technical fact is that the design matrix cannot grow without limit in all directions simultaneously. Specifically, using the matrix determinant lemma \(\det A_t = \det A_{t-1} \cdot (1 + \|a_t\|_{A_{t-1}^{-1}}^2)\):
\[ \prod_{t=1}^{T} \bigl(1 + \|a_t\|_{A_{t-1}^{-1}}^2\bigr) = \frac{\det A_T}{\det A_0} \leq \left(1 + \frac{TL^2}{\lambda d}\right)^d, \]
where the last inequality uses \(\operatorname{tr}(A_T) \leq \lambda d + TL^2\) and the AM–GM bound on the determinant. Taking logarithms and using \(\log(1+x) \leq x\):
\[ \sum_{t=1}^{T} \log\bigl(1 + \|a_t\|_{A_{t-1}^{-1}}^2\bigr) \leq d \log\!\left(1 + \frac{TL^2}{\lambda d}\right). \]
Since \(\log(1+x) \geq x/2\) for \(x \in [0,1]\), this bounds \(\sum_t \min(1, \|a_t\|_{A_{t-1}^{-1}}^2)\) and, by Cauchy–Schwarz, controls \(\sum_t \|a_t\|_{A_{t-1}^{-1}}\).
Comparison
| LinUCB | OFUL | |
|---|---|---|
| Action set | Discrete arms, per-arm models | Arbitrary \(\mathcal{A}_t \subseteq \mathbb{R}^d\) |
| Parameters | \(K\) per-arm vectors \(\theta_a^*\) | Single vector \(\theta^*\) |
| Design matrix | One per arm | One shared |
| Confidence set | Per-arm ellipsoid | Global ellipsoid |
| Information sharing | None across arms | Full (one \(\theta^*\)) |
When arms have disjoint feature supports LinUCB and OFUL with a joint feature map are equivalent. When arms share a common \(\theta^*\), OFUL’s shared matrix is strictly more sample-efficient.
References
(Li et al. 2010; Abbasi-Yadkori et al. 2011; Lattimore and Szepesvári 2020).