Authors
Mohammad Gheshlaghi Azar, Vicenc Gomez, Hilbert J. Kappen
Publication date
2012/11
Journal
Journal of Machine Learning Research
Volume
13
Pages
3207-3245
Publisher
Microtome Publishing
Description
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞-norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods.
Total citations
201120122013201420152016201720182019202020212022202320242219811121018202417711
Scholar articles
MG Azar, V Gómez, HJ Kappen - The Journal of Machine Learning Research, 2012