Your institution may have access to this item. Find your institution then sign in to continue.

Title: Minimax weight learning for absorbing MDPs.
Authors: Li, Fengying; Li, Yuqiang; Wu, Xianyi
Abstract: Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment
Subjects: STATISTICAL errors; STATISTICS; EXPECTED returns; MARKOV processes; REINFORCEMENT learning
Publication: Statistical Papers, 2024, Vol 65, Issue 6, p3545
ISSN: 0932-5026
Publication type: Article
DOI: 10.1007/s00362-023-01491-4

We found a match

Minimax weight learning for absorbing MDPs.