Authors
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert E. Schapire
Publication date
2014
Conference
Thirty-First International Conference on Machine Learning
Description
We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of K\emphactions in response to the observed\emphcontext, and observes the\emphreward only for that action. Our method assumes access to an oracle for solving fully supervised cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only\otil (\sqrtKT) oracle calls across all T rounds. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes. We conduct a proof-of-concept experiment which demonstrates the excellent computational and statistical performance of (an online variant of) our algorithm relative to several strong baselines.
Total citations
20142015201620172018201920202021202220232024418343551657268777767
Scholar articles
A Agarwal, D Hsu, S Kale, J Langford, L Li, R Schapire - International conference on machine learning, 2014