View article

[PDF] from hunch.net

Exploration scavenging

Authors

John Langford, Alexander Strehl, Jennifer Wortman

Publication date

2008/7/5

Conference

Proceedings of the 25th international conference on Machine learning

Pages

528-535

Publisher

ACM

Description

We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide non-trivial and realistic applications. We also provide an empirical …

Total citations

Cited by 153

200820092010201120122013201420152016201720182019202020212022202320242 2 2 14 9 14 9 15 12 15 13 12 10 7 7 6 4

Scholar articles

Exploration scavenging

J Langford, A Strehl, J Wortman - Proceedings of the 25th international conference on …, 2008