View article

[PDF] from mlr.press

The best of both worlds: stochastic and adversarial bandits

Authors

Sébastien Bubeck, Aleksandrs Slivkins

Publication date

2012/6/16

Conference

Conference on Learning Theory

Pages

42.1-42.23

Description

We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the\emphO (√\emphn) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly) logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on multi-armed bandits (MAB). Prior work on MAB treats them separately, and does not attempt to jointly optimize for both. This result falls into the general agenda to design algorithms that combine the optimal worst-case performance with improved guarantees for “nice” problem instances.

Total citations

Cited by 261

20122013201420152016201720182019202020212022202320242 3 9 16 11 12 23 24 27 39 37 35 23

Scholar articles

The best of both worlds: Stochastic and adversarial bandits

S Bubeck, A Slivkins - Conference on Learning Theory, 2012