Minimax policies for adversarial and stochastic bandits Jean-Yves Audibert Imagine, Universite Paris Est & Willow, CNRS/ENS/INRIA, Paris, France Sebastien Bubeck SequeL Project, INRIA Lille 40 avenue Halley, 59650 Villeneuve d'Ascq, France Abstract We fill in a long open gap in the characterization of the minimax rate for the multi-armed bandit prob- lem. Concretely, we remove an extraneous loga- rithmic factor in the previously known upper bound and propose a new family of randomized algorithms based on an implicit normalization, as well as a new analysis. We also consider the stochastic case, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002) achieves the distribution-free optimal rate while still having a distribution-dependent rate log- arithmic in the number of plays. 1 Introduction In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and re- ceives a reward from it. The agent aims at maximizing his rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms.
- distribution over
- reward vectors
- nk
- bandit problem
- vectors g1
- stochastic multi-armed
- nk logk
- armed bandit