University of Massachusetts Amhert Machine Learning and Friends Lunch

Mar 4, 2021

Thompson Sampling (TS) is an elegant Bayesian heuristic to address the exploration-exploitation trade-off in bandit problems. Despite its Bayesian nature, TS is known to achieve a minimax near-optimal regret bound in the multi-armed bandit setting. In the linear bandit problem, however, optimal bounds are only proved for its Bayesian regret. Agrawal and Goyal (2013b); Abeille et al. (2017) study the worst-case regret of a variant of TS which samples from a distribution whose variance is scaled by $\widetilde{O}(d)$ and prove a $\widetilde{O}(d\sqrt{dT})$ minimax regret bound which is far from optimal by a factor of $\widetilde{O}(\sqrt{d})$. In this talk, we show that this inflation in the variance of the posterior distribution is inevitable in general. Moreover, we introduce some conditions under which the inflation can be reduced to $\widetilde{O}(1)$; thereby, improving its regret bound.

This is joint work with Mohsen Bayati.

Nima Hamidi
Nima Hamidi
Ph.D. Student in Statistics

I’m a Ph.D. student at Stanford and my area of research includes the theory of bandit problems.