View article

[PDF] from arxiv.org

Zeroth-order supervised policy improvement

Authors

Hao Sun, Ziping Xu, Yuhang Song, Meng Fang, Jiechao Xiong, Bo Dai, Bolei Zhou

Publication date

2020/6/11

Journal

arXiv preprint arXiv:2006.06600

Description

Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.

Total citations

Cited by 13

20212022202320242 5 4 2

Scholar articles

Zeroth-order supervised policy improvement

H Sun, Z Xu, Y Song, M Fang, J Xiong, B Dai, B Zhou - arXiv preprint arXiv:2006.06600, 2020