Safe and efficient inverse reinforcement learning

Brown, Daniel Sundquist
Journal Title
Journal ISSN
Volume Title

As robots and other autonomous agents enter our homes, hospitals, schools, and workplaces, it is important that they can safely and efficiently infer and adapt to human preferences. One common way to teach human preferences to robots and other autonomous agents is through imitation learning, where an agent learns by observing demonstrations of how to perform a task. Imitation learning has the potential to allow everyday users the ability to program and adapt the behavior of an autonomous agent simply by showing it how to perform a task. However, for imitation learning algorithms to be deployed in complex, possibly high-risk situations, it is important that these algorithms can provide practical, high-confidence bounds on performance.

If a robot is to reason effectively about its performance when learning from demonstrations, it needs to infer the goals and intent of the demonstrator. One common way to infer goals and intentions from demonstrations is through inverse reinforcement learning, where the goal is to infer the reward function of the demonstrator. However, most inverse reinforcement learning algorithms have limited real-world applicability because they do not provide practical assessments of safety, often require large numbers of demonstrations, and have high computational costs. This dissertation addresses these shortcomings by developing efficient inverse reinforcement learning algorithms that allow autonomous agents to provide high-confidence bounds on performance when learning from demonstrations.

We first formalize the problem of safe imitation learning via high-confidence performance bounds. We then present a general Bayesian framework for computing tight high-confidence performance bounds on any evaluation policy when the true reward function is unknown and must be inferred from demonstrations. The method we propose is sample-efficient, but is computationally inefficient for learning in complex, high-dimensional tasks. To address this computational inefficiency, we first introduce a computationally efficient algorithm for reward learning via ranked demonstrations. We show that preference rankings over demonstrations enable reward inference algorithms to scale to high-dimensional imitation learning tasks such as learning to play Atari games with no access to the score, but with access to a few suboptimal, ranked demonstrations. We also show that preference rankings allow for better-than-demonstrator performance and that rankings over demonstrations can sometimes be obtained automatically, without requiring access to explicit preference labels. Furthermore, we leverage the computational efficiency of reward learning via preferences to scale high-confidence policy evaluation to complex imitation learning settings with high-dimensional, visual demonstrations.

While our work on high-confidence policy evaluation gives efficient bounds on the performance of an imitation learning agent, it does not answer the question of what an agent should do to learn a policy that is safe with high probability. The final contributions of this dissertation are two different approaches for robust policy optimization for imitation learning. We first derive an algorithm that directly optimizes a policy to balance risk and expected return under a reward function posterior given a fixed set of demonstrations. Second, we address the problem of robust policy optimization via active learning. We present a sample-efficient, active inverse reinforcement learning algorithm that generates risk-aware queries that enable robust policy improvement via repeated interactions with a demonstrator.