Minimax policies for adversarial and stochastic bandits

pages

English

Documents

Écrit par
Jean - Yves Audibert Imagine

Publié par
pefav

Lire un extrait

Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus

Découvre YouScribe et accède à tout notre catalogue !

Je m'inscris

Découvre YouScribe et accède à tout notre catalogue !

Je m'inscris

pages

English

Documents

Lire un extrait

Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus

Publié par

pefav

Nombre de lectures

Langue

English

Minimax policies for adversarial and stochastic bandits Jean-Yves Audibert Imagine, Universite Paris Est & Willow, CNRS/ENS/INRIA, Paris, France Sebastien Bubeck SequeL Project, INRIA Lille 40 avenue Halley, 59650 Villeneuve d'Ascq, France Abstract We fill in a long open gap in the characterization of the minimax rate for the multi-armed bandit prob- lem. Concretely, we remove an extraneous loga- rithmic factor in the previously known upper bound and propose a new family of randomized algorithms based on an implicit normalization, as well as a new analysis. We also consider the stochastic case, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002) achieves the distribution-free optimal rate while still having a distribution-dependent rate log- arithmic in the number of plays. 1 Introduction In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and re- ceives a reward from it. The agent aims at maximizing his rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms.

distribution over

reward vectors

bandit problem

vectors g1

stochastic multi-armed

nk logk

armed bandit

Voir

Publié par

pefav

Langue

English

Minimax policies for adversarial and stochastic bandits

Jean-Yves Audibert Imagine, Universite Paris Est & Willow, CNRS/ENS/INRIA, Paris, France audibert@certis.enpc.fr

Abstract

We ﬁll in a long open gap in the characterization of the minimax rate for the multi-armed bandit prob-lem. Concretely, we remove an extraneous loga-rithmic factor in the previously known upper bound and propose a new family of randomized algorithms based on an implicit normalization, as well as a new analysis. We also consider the stochastic case, and prove that an appropriate modiﬁcation of the upper conﬁdence bound policy UCB1 (Auer et al., 2002) achieves the distribution-free optimal rate while still having a distribution-dependent rate log-arithmic in the number of plays.

1 Introduction In the multi-armed bandit problem, at each stage, an agent (or decision maker) chooses one action (or arm), and re-ceives a reward from it. The agent aims at maximizing his rewards. Since he does not know the process generating the rewards, he needs to explore (try) the different actions and yet, exploit (concentrate its draws on) the seemingly most rewarding arms. The multi-armed bandit task has been ﬁrst considered by Robbins (1952) and was originally motivated by a simpliﬁed view of clinical trials (in which an action con-sists in choosing a treatment and the reward depends on its efﬁciency on a patient). To set the notation, letK≥2be the number of actions (or arms) andn≥Kbe the time horizon. In both stochastic and adversarialK-armed bandit problems, the game between the agent and the environment goes as follows: At each time stept∈ {1, . . . , n}, (i) the agent chooses a probability dis-tributionpton a ﬁnite set{1, . . . , K}, (ii) the environment K chooses a reward vectorgt= (g1,t, . . . , gK,t)∈[0,1], and simultaneously (independently), the agent draws the action (or arm)Itaccording to the distributionpt, (iii) the agent only gets to see his own rewardgIt,t. The goal of the deci-P n imize his cumulative reward sion maker is to maxt=1gIt,t. The stochasticK-armed bandit problem is parameter-ized by aK-tuple of probability distributions(ν1, . . . , νK) on[0,1]. In this model, the components of the reward vector are i.i.d. realizations of respectivelyν1, . . . , νK. Besides the reward vectors at different times are independent. The adversarialK-armed bandit problem is more gen-eral: The environment is much less constrained as it may

SebastienBubeck SequeL Project, INRIA Lille 40 avenue Halley, 59650 Villeneuve d’Ascq, France sebastien.bubeck@inria.fr

choose a reward vectorgtas a function of the past deci-sionsI1, . . . , It−1(and possibly of an independent random-ization). A simple, but interesting, adversarial environment is obtained by considering deterministic reward vectors: in this case, known as the oblivious deterministic opponent, the environment is just parameterized by thenKreal numbers (gi,t)that in the adversarial environment, past gains. Note have no reason to be representative of future ones in contrast to the stochastic setting in which conﬁdence bounds on the mean reward of the arms can be deduced from the rewards obtained so far. A policy is a strategy for choosing the drawing proba-bility distributionptbased on the history formed by the past plays and the associated rewards. So it is a function that maps any history to a probability distribution on{1, . . . , K}. We deﬁne the regret of a policy with respect to the best constant decision as n X   1 Rn= maxEgi,t−gIt,t. i=1,...,K t=1 To compare to the best constant decision is a reasonable tar-get since it is well-known that (i) there exist randomized poli-cies ensuring thatRn/ntends to zero asngoes to inﬁnity, (ii) this convergence property would not hold if the maxi-mum and the sum would be inverted in the deﬁnition ofRn. The minimax rate of the expected regret isinf supRn, where the inﬁmum is taken over all policies and the supre-mum over allK-tuples of probability distributions on[0,1] for the stochastic case and over all adversarial environments with rewards in[0,1]Auer et al.for the adversarial case. (1995) proved an upper bound on this quantity in the adver-sarial case (and thus also in the stochastic case) and a lower bound in the stochastic case (and thus also in the adversarial case). We recall here the results of Auer et al. (1995, 2003). Theorem 1The EXP3 policy described in Fig.1 satisﬁes p supRn≤2.7nKlogK, 1 In the case of an oblivious deterministic opponent this regret is P   n equal to the strong regregi,v-t:Emaxi t=1t−gIt,t. For obli ious stochastic opponents generating independently thenreward vectorsg1, . . . , gn, one can prove that the difference between the p regret and the strong regret is at most(nlogK)/2in both. So, cases, Theorem 4 can be extended to the strong regret. Besides, for the same reason, Theorem 5 holds for the strong regret provided p that(nlogK)/2is added to the bound.

Voir