Authors
Long-Fei Li, Peng Zhao, Zhi-Hua Zhou
Publication date
2024/3/24
Journal
Proceedings of the AAAI Conference on Artificial Intelligence
Volume
38
Issue
12
Pages
13572-13580
Description
We study reinforcement learning (RL) in episodic MDPs with adversarial full-information losses and the unknown transition. Instead of the classical static regret, we adopt \emph{dynamic regret} as the performance measure which benchmarks the learner's performance with \emph{changing} policies, making it more suitable for non-stationary environments. The primary challenge is to handle the uncertainties of unknown transition and unknown non-stationarity of environments simultaneously. We propose a general framework to decouple the two sources of uncertainties and show the dynamic regret bound naturally decomposes into two terms, one due to constructing confidence sets to handle the unknown transition and the other due to choosing sub-optimal policies under the unknown non-stationarity. To this end, we first employ the two-layer online ensemble structure to handle the adaptation error due to the unknown non-stationarity, which is model-agnostic. Subsequently, we instantiate the framework to three fundamental MDP models, including tabular MDPs, linear MDPs and linear mixture MDPs, and present corresponding approaches to control the exploration error due to the unknown transition. We provide dynamic regret guarantees respectively and show they are optimal in terms of the number of episodes and the non-stationarity by establishing matching lower bounds. To the best of our knowledge, this is the first work that achieves the optimal (w.r.t. and ) dynamic regret \emph{without} prior knowledge about the non-stationarity of environments for adversarial MDPs with the unknown transition.
Total citations
Scholar articles
LF Li, P Zhao, ZH Zhou - Proceedings of the AAAI Conference on Artificial …, 2024