In this note, we introduce a randomized version of the well-known elliptical potential lemma that is widely used in the analysis of algorithms in sequential learning and decision-making problems such as stochastic linear bandits.
In this paper, we consider the worst-case regret of Linear Thompson Sampling (LinTS) for the linear bandit problem. Russo and Van Roy (2014) show that the Bayesian regret of LinTS is bounded above by $\widetilde{\mathcal{O}}(d\sqrt{T})$ where $T$ is the time horizon and $d$ is the number of parameters.
We study the structure of regret-minimizing policies in the many-armed Bayesian multi-armed bandit problem: in particular, with $k$ the number of arms and $T$ the time horizon, we consider the case where $k \geq \sqrt{T}$.
In this paper, we study the well-known stochastic linear bandit problem where a decision-maker sequentially chooses among a set of given actions in $\mathbb{R}^d$, observes their noisy linear reward, and aims to maximize her cumulative expected reward over a horizon of length $T$.
We consider the $k$-armed stochastic contextual bandit problem with $d$ dimensional features, when both $k$ and $d$ can be large. To the best of our knowledge, all existing algorithm for this problem have a regret bound that scale as polynomials of degree at least two in $k$ and $d$.
In this paper, we study the trace regression when a matrix of parameters $\mathbf{B}^\star$ is estimated via convex relaxation of rank-penalized regression or non-convex optimization. It is known that these estimators satisfy near optimal error bounds under assumptions on rank, coherence, or spikiness of $\mathbf{B}^\star$.