Date

Oct 2, 2020

Multi-armed bandit (MAB) experiments have recently received significant attention in data-centric enterprises, given their promise in reducing opportunity cost of experimentation. In this talk we consider the well-known stochastic linear bandit problem which includes (as special case) standard MAB experiments and their personalized (contextual) counterpart. In this setting a decision-maker sequentially chooses among a set of given actions in $R^d$, observes their noisy reward, and aims to maximize her cumulative expected reward over a horizon of length $T$.

A well-known algorithm for this problem that is also commonly used in practice is Linear Thompson Sampling (LinTS). LinTS is shown, by Russo and Van Roy (2014), to achieve the best possible Bayesian regret up to logarithmic factors. However, in the frequentist setting, there is a factor $\sqrt{d}$ gap between the minimax lower bound and the best provable upper bound for LinTS. This best upper bound, shown by Agrawal and Goyal (2013) and Abeille and Lazaric (2017), requires the posterior variance to be inflated by a factor $d$, in each round of LinTS. In this talk we first show that the bound of Agrawal and Goyal (2013) and Abeille and Lazaric (2017) is the best achievable. In the second half of the talk, we introduce a new family of algorithms with comparable empirical performance as LinTS, while enjoying the same theoretical guarantees as the optimism in the face of uncertainty linear bandit (OFUL) algorithm.