Authors
Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos
Publication date
2017/7/17
Conference
International conference on machine learning
Pages
263-272
Publisher
PMLR
Description
We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of where is the time horizon, the number of states, the number of actions and the number of time-steps. This result improves over the best previous known bound achieved by the UCRL2 algorithm. The key significance of our new results is that when and , it leads to a regret of that matches the established lower bound of up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in ), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in ).
Total citations
2017201820192020202120222023202452258115152159192126
Scholar articles
MG Azar, I Osband, R Munos - International conference on machine learning, 2017