Contextual Bandits

Motivation

In the standard multi-armed bandit setting, the reward distribution of each arm is fixed and the agent ignores any side information. In many real applications, however, additional information — a context — is available before each decision. A recommendation system knows the user’s browsing history before suggesting an article; a clinical system knows patient demographics before assigning a treatment.

The contextual bandit problem extends multi-armed bandits by allowing the agent to observe a context \(x_t\) at each round before selecting an action. The agent must learn a policy that maps contexts to actions, rather than a single best action.

Setup

At each round \(t = 1, 2, \ldots, T\):

The environment reveals a context \(x_t \in \mathbb{R}^d\).
The agent selects an action \(a_t \in \{1, \ldots, K\}\).
The environment returns a reward \(r_t = f(x_t, a_t) + \eta_t\), where \(f\) is an unknown mean-reward function and \(\eta_t\) is noise.

The agent aims to minimize contextual regret:

\[ \text{Regret}(T) = \sum_{t=1}^{T} \max_{a \in \{1,\ldots,K\}} f(x_t, a) - f(x_t, a_t). \]

This is the cumulative shortfall versus the best action for each specific context \(x_t\). Contextual bandits sit between multi-armed bandits (no context) and full reinforcement learning (state transitions). The absence of transitions keeps each round independent given the context, which simplifies analysis considerably.

Linear Contextual Bandits and LinUCB

The most studied special case assumes the mean reward is linear in the context:

\[ f(x, a) = x^\top \theta_a^*, \]

where \(\theta_a^* \in \mathbb{R}^d\) is an unknown parameter vector for each arm \(a\). The full algorithmic treatment, including the OFUL variant for arbitrary action sets, is in linear bandits.

The LinUCB algorithm (Li et al. 2010) maintains for each arm \(a\):

Design matrix: \(A_a = \lambda I + \sum_{\tau : a_\tau = a} x_\tau x_\tau^\top\).
Target vector: \(b_a = \sum_{\tau : a_\tau = a} r_\tau x_\tau\).
Ridge estimate: \(\hat{\theta}_a = A_a^{-1} b_a\).

At each round, LinUCB selects the arm with the highest upper confidence bound:

\[ a_t = \operatorname*{arg\,max}_{a} \left[ x_t^\top \hat{\theta}_a + \alpha \sqrt{x_t^\top A_a^{-1} x_t} \right], \]

where \(\alpha > 0\) controls the confidence level. The bonus \(\sqrt{x_t^\top A_a^{-1} x_t}\) is the standard deviation of the predicted mean in the direction \(x_t\) — large when \(x_t\) points in a direction where the estimate is uncertain. This is the natural multi-dimensional generalization of the scalar UCB bonus.

Regret Bound

Under the linear model and sub-Gaussian noise, an optimized confidence-set construction (Abbasi-Yadkori et al. 2011) achieves:

\[ \text{Regret}(T) = \tilde{O}\!\left( d\sqrt{T} \right), \]

where \(\tilde{O}\) suppresses logarithmic factors. The \(d\sqrt{T}\) rate reflects the cost of estimating a \(d\)-dimensional parameter. Each time the agent pulls a suboptimal arm, the ellipsoidal confidence set around \(\theta_a^*\) must have been wide in the direction \(x_t\); each such pull shrinks the ellipsoid, bounding how often this can happen.

Joint Linear Model

An alternative formulation uses a single parameter vector \(\theta^*\) and feature vectors \(\phi(x, a) \in \mathbb{R}^d\) encoding both context and action:

\[ f(x, a) = \phi(x, a)^\top \theta^*. \]

This allows arms to share statistical strength and is necessary when \(K\) is large or continuous. The LinUCB update carries over directly with \(x\) replaced by \(\phi(x, a)\) and a single design matrix over all arms.

Thompson Sampling for Contextual Bandits

Rather than maintaining a confidence set, Thompson sampling (Agrawal and Goyal 2013) samples \(\tilde{\theta}_a\) from the posterior of each \(\theta_a^*\) and pulls \(a_t = \arg\max_a x_t^\top \tilde{\theta}_a\). Under a Gaussian prior the posterior is Gaussian (conjugate to the linear-Gaussian likelihood), so sampling is cheap. Thompson sampling achieves the same \(\tilde{O}(d\sqrt{T})\) regret rate and often performs better in practice.

Relationship to Reinforcement Learning

Contextual bandits are a special case of reinforcement learning where the state is the context and the state transitions are trivial — each round begins from a fresh, independently drawn context. The absence of transitions eliminates credit assignment over time, making regret analysis considerably simpler than the full RL setting.

References

(Li et al. 2010; Abbasi-Yadkori et al. 2011; Agrawal and Goyal 2013; Lattimore and Szepesvári 2020).

References

Abbasi-Yadkori, Yasin, Dávid Pál, and Csaba Szepesvári. 2011. “Improved Algorithms for Linear Stochastic Bandits.” Advances in Neural Information Processing Systems (NeurIPS) 24. https://proceedings.neurips.cc/paper/2011/hash/e1d5be1c7f2f456670de3d53c7b54f4a-Abstract.html.

Agrawal, Shipra, and Navin Goyal. 2013. “Thompson Sampling for Contextual Bandits with Linear Payoffs.” International Conference on Machine Learning (ICML), 127–35. https://proceedings.mlr.press/v28/agrawal13.html.

Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://www.cambridge.org/core/books/bandit-algorithms/8E39FD004E6CE036680F90DD0C6F09FC.

Li, Lihong, Wei Chu, John Langford, and Robert E. Schapire. 2010. “A Contextual-Bandit Approach to Personalized News Article Recommendation.” Proceedings of the 19th International Conference on World Wide Web, WWW ’10, April, 661–70. https://doi.org/10.1145/1772690.1772758.