Minimax policies for adversarial and stochastic bandits

icon

10

pages

icon

English

icon

Documents

Écrit par

Publié par

Lire un extrait
Lire un extrait

Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus

Découvre YouScribe et accède à tout notre catalogue !

Je m'inscris

Découvre YouScribe et accède à tout notre catalogue !

Je m'inscris
icon

10

pages

icon

English

icon

Documents

Lire un extrait
Lire un extrait

Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus

Minimax policies for adversarial and stochastic bandits Jean-Yves Audibert Imagine, Universite Paris Est & Willow, CNRS/ENS/INRIA, Paris, France Sebastien Bubeck SequeL Project, INRIA Lille 40 avenue Halley, 59650 Villeneuve d'Ascq, France Abstract We fill in a long open gap in the characterization of the minimax rate for the multi-armed bandit prob- lem. Concretely, we remove an extraneous loga- rithmic factor in the previously known upper bound and propose a new family of randomized algorithms based on an implicit normalization, as well as a new analysis. We also consider the stochastic case, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002) achieves the distribution-free optimal rate while still having a distribution-dependent rate log- arithmic in the number of plays. 1 Introduction In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and re- ceives a reward from it. The agent aims at maximizing his rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms.

  • distribution over

  • reward vectors

  • nk

  • bandit problem

  • vectors g1

  • stochastic multi-armed

  • nk logk

  • armed bandit


Voir icon arrow

Publié par

Langue

English

Minimax policies for adversarial and stochastic bandits
Jean-Yves Audibert Imagine, Universite Paris Est & Willow, CNRS/ENS/INRIA, Paris, France audibert@certis.enpc.fr
Abstract
We fill in a long open gap in the characterization of the minimax rate for the multi-armed bandit prob-lem. Concretely, we remove an extraneous loga-rithmic factor in the previously known upper bound and propose a new family of randomized algorithms based on an implicit normalization, as well as a new analysis. We also consider the stochastic case, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002) achieves the distribution-free optimal rate while still having a distribution-dependent rate log-arithmic in the number of plays.
1 Introduction In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and re-ceives a reward from it. The agent aims at maximizing his rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms. The multi-armed bandit task has been first considered by Robbins (1952) and was originally motivated by a simplified view of clinical trials (in which an action con-sists in choosing a treatment and the reward depends on its efficiency on a patient). To set the notation, letK2be the number of actions (or arms) andnKbe the time horizon. In both stochastic and adversarialK-armed bandit problems, the game between the agent and the environment goes as follows: At each time stept∈ {1, . . . , n}, (i) the agent chooses a probability dis-tributionpton a finite set{1, . . . , K}, (ii) the environment K chooses a reward vectorgt= (g1,t, . . . , gK,t)[0,1], and simultaneously (independently), the agent draws the action (or arm)Itaccording to the distributionpt, (iii) the agent only gets to see his own rewardgIt,t. The goal of the deci-P n imize his cumulative reward sion maker is to maxt=1gIt,t. The stochasticK-armed bandit problem is parameter-ized by aK-tuple of probability distributions(ν1, . . . , νK) on[0,1]. In this model, the components of the reward vector are i.i.d. realizations of respectivelyν1, . . . , νK. Besides the reward vectors at different times are independent. The adversarialK-armed bandit problem is more gen-eral: The environment is much less constrained as it may
SebastienBubeck SequeL Project, INRIA Lille 40 avenue Halley, 59650 Villeneuve d’Ascq, France sebastien.bubeck@inria.fr
choose a reward vectorgtas a function of the past deci-sionsI1, . . . , It1(and possibly of an independent random-ization). A simple, but interesting, adversarial environment is obtained by considering deterministic reward vectors: in this case, known as the oblivious deterministic opponent, the environment is just parameterized by thenKreal numbers (gi,t)that in the adversarial environment, past gains. Note have no reason to be representative of future ones in contrast to the stochastic setting in which confidence bounds on the mean reward of the arms can be deduced from the rewards obtained so far. A policy is a strategy for choosing the drawing proba-bility distributionptbased on the history formed by the past plays and the associated rewards. So it is a function that maps any history to a probability distribution on{1, . . . , K}. We define the regret of a policy with respect to the best constant decision as n X   1 Rn= maxEgi,tgIt,t. i=1,...,K t=1 To compare to the best constant decision is a reasonable tar-get since it is well-known that (i) there exist randomized poli-cies ensuring thatRn/ntends to zero asngoes to infinity, (ii) this convergence property would not hold if the maxi-mum and the sum would be inverted in the definition ofRn. The minimax rate of the expected regret isinf supRn, where the infimum is taken over all policies and the supre-mum over allK-tuples of probability distributions on[0,1] for the stochastic case and over all adversarial environments with rewards in[0,1]Auer et al.for the adversarial case. (1995) proved an upper bound on this quantity in the adver-sarial case (and thus also in the stochastic case) and a lower bound in the stochastic case (and thus also in the adversarial case). We recall here the results of Auer et al. (1995, 2003). Theorem 1The EXP3 policy described in Fig.1 satisfies p supRn2.7nKlogK, 1 In the case of an oblivious deterministic opponent this regret is P  n equal to the strong regregi,v-t:Emaxi t=1tgIt,t. For obli ious stochastic opponents generating independently thenreward vectorsg1, . . . , gn, one can prove that the difference between the p regret and the strong regret is at most(nlogK)/2in both. So, cases, Theorem 4 can be extended to the strong regret. Besides, for the same reason, Theorem 5 holds for the strong regret provided p that(nlogK)/2is added to the bound.
Voir icon more
Alternate Text