View article

[PDF] from academia.edu

On upper-confidence bound policies for switching bandit problems

Authors

Aurélien Garivier, Eric Moulines

Publication date

2011/10/5

Book

International conference on algorithmic learning theory

Pages

174-188

Publisher

Springer Berlin Heidelberg

Description

Many problems, such as cognitive radio, parameter control of a scanning tunnelling microscope or internet advertisement, can be modelled as non-stationary bandit problems where the distributions of rewards changes abruptly at unknown time instants. In this paper, we analyze two algorithms designed for solving this issue: discounted UCB (D-UCB) and sliding-window UCB (SW-UCB). We establish an upper-bound for the expected regret by upper-bounding the expectation of the number of times suboptimal arms are played. The proof relies on an interesting Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lower-bound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the sliding-window UCB both match the lower-bound up to a logarithmic factor. Numerical simulations show that …

Total citations

Cited by 596

20122013201420152016201720182019202020212022202320245 3 13 24 16 34 33 66 72 83 82 96 65

Scholar articles

On upper-confidence bound policies for switching bandit problems

A Garivier, E Moulines - International conference on algorithmic learning theory, 2011