View article

[PDF] from jmlr.org

Dynamic Policy Programming

Authors

Mohammad Gheshlaghi Azar, Vicenc Gomez, Hilbert J. Kappen

Publication date

2012/11

Journal

Journal of Machine Learning Research

Volume

Pages

3207-3245

Publisher

Microtome Publishing

Description

In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞-norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods.

Total citations

Cited by 153

201120122013201420152016201720182019202020212022202320242 2 1 9 8 11 12 10 18 20 24 17 7 11

Scholar articles

Dynamic policy programming

MG Azar, V Gómez, HJ Kappen - The Journal of Machine Learning Research, 2012