Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation
Aaron Sonabend,Junwei Lu,Leo Anthony Celi,Tianxi Cai,Peter Szolovits
Offline Reinforcement Learning (RL) is a promising approach for learning optimalpolicies in environments where direct exploration is expensive or unfeasible. However, the adoption of such policies in practice is often challenging, as they are hardto interpret within the application context, and lack measures of uncertainty for thelearned policy value and its decisions. To overcome these issues, we propose anExpert-Supervised RL (ESRL) framework which uses uncertainty quantificationfor offline policy learning. In particular, we have three contributions: 1) the methodcan learn safe and optimal policies through hypothesis testing, 2) ESRL allows fordifferent levels of risk averse implementations tailored to the application context,and finally, 3) we propose a way to interpret ESRLu2019s policy at every state throughposterior distributions, and use this framework to compute off-policy value functionposteriors. We provide theoretical guarantees for our estimators and regret boundsconsistent with Posterior Sampling for RL (PSRL). Sample efficiency of ESRLis independent of the chosen risk aversion threshold and quality of the behaviorpolicy.


