We formalize the problem of learning constraints from expert demonstrations by extending inverse reinforcement learning, and we develop a multi-task version of inverse constraint learning to avoid selecting degenerate constraints. We validate our methods on high-dimensional continuous control tasks and show that we can match expert performance and recover ground-truth constraints.
Regardless of the particular task we want them to perform in an environment, there are often shared safety constraints we want our agents to respect. For example, regardless of whether it is making a sandwich or clearing the table, a kitchen robot should not break a plate. Manually specifying such a constraint can be both time-consuming and error-prone. We show how to learn constraints from expert demonstrations of safe task completion by extending inverse reinforcement learning (IRL) techniques to the space of constraints. Intuitively, we learn constraints that forbid highly rewarding behavior that the expert could have taken but chose not to. Unfortunately, the constraint learning problem is rather ill-posed and typically leads to overly conservative constraints that forbid all behavior that the expert did not take. We counter this by leveraging diverse demonstrations that naturally occur in multi-task settings to learn a tighter set of constraints. We validate our method with simulation experiments on high-dimensional continuous control tasks.
We consider a setting where we have access to expert demonstrations of a task, along with the task's reward.
This allows us to compare the behavior of the expert and reward-optimal policy for a task. Our first insight is that actions taken by the reward-optimal but not the expert policy are likely to be forbidden.
We can extract a constraint in this way by formulating ICL
(inverse constraint learning) as a two-player zero sum game between a constraint and policy player:
$$\color{red}{\max_{c \in \mathcal{F}_c} \max_{\lambda > 0}}\color{green}{\min_{\pi \in \Pi}} J(\pi_E, r - \color{red}{\lambda c}) - J(\color{green}{\pi}, r - \color{red}{\lambda c})$$
For a fixed constraint, the policy player aims to maximize their reward while satisfying the constraint by solving constrained RL: $$ \color{red}{\max_{\lambda > 0}}\color{green}{\min_{\pi \in \Pi}} -J(\color{green}{\pi}, r - \color{red}{\lambda c})$$
For a fixed policy, the constraint player uses classification to pick the constraint that maximally penalizes the learner relative to the expert: $$\color{red}{\max_{c \in \mathcal{F}_c}} J(\pi_E, -\color{red}{\lambda c}) - J(\color{green}{\pi}, -\color{red}{\lambda c})$$
One potential failure mode of ICL
is that it can lead to an overly conservative constraint which forbids all non-expert behavior. Such a constraint would fail to generalize to new tasks. To resolve this, we propose a multi-task version of inverse constraint learning, MT-ICL
, which provides better coverage of the state space and learns a shared constraint across multiple tasks.
If we observe $K$ samples of the form $(r_k, \{\xi \sim \pi_E^k \})$, we can formulate the multi-task game as:
We further provide a statistical condition of generalization of the shared constraint and empirically show that MT-ICL
generalizes better to new tasks.
We provide implementations of constrained reinforcement learning and inverse constraint learning and benchmark them on tasks from the PyBullet and MuJoCo suite. For a single task with restricted function classes of linear constraints, we show that ICL
can exactly recover the ground truth constraint, match expert performance and constraint satisfaction, and is even robust to suboptimal expert demonstrations.
We consider the task of ant locomotion with a position constraint of staying above the line $y=0.5x$.
Over the course of ICL
training, the learned position constraint (blue line) converges to the ground truth constraint (red line), and the ant learns to escape the unsafe red region.
We evaluate MT-ICL
on the more challenging AntMaze setting where navigating to different goals corresponds to different tasks. The constraint in this setting is to not walk through the
walls of the maze. Within a single iteration, MT-ICL
is able to learn policies that match expert performance and constraint violation, all without ever interacting with the ground-truth maze.
We visualize our trained policy and the output of our constraint network below.
We release all of our code at the link below.
Konwoo Kim*, Gokul Swamy*, Zuxin Liu, Ding Zhao, Sanjiban Choudhury, Zhiwei Steven Wu
@inproceedings{kim2023learning,
title={Learning Shared Safety Constraints from Multi-task Demonstrations},
author={Konwoo Kim and Gokul Swamy and Zuxin Liu and Ding Zhao and Sanjiban Choudhury and Steven Wu},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=8U31BCquNF}
}
This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project, and adapted to be mobile responsive by Jason Zhang. The code we built on can be found here.