When attempting to mimic an expert, a learner could learn by (a) rolling out their policy and comparing generated trajectories to expert trajectories, (b) producing actions on expert states and attempting to match action-conditionals, or (c) performing rollouts and attempting to match corrections provided by a queryable expert. We provide, for each of these settings, bounds for how well the learner can do, reduction-based algorithms for efficiently finding strong policies, and simple yet competitive practical instantiations that can scale to high-dimensional tasks.
We provide a unifying view of a large family of previous imitation learning algorithms through the lens of moment matching. At its core, our classification scheme is based on whether the learner attempts to match (1) reward or (2) action-value moments of the expert's behavior, with each option leading to differing algorithmic approaches. By considering adversarially chosen divergences between learner and expert behavior, we are able to derive bounds on policy performance that apply for all algorithms in each of these classes, the first to our knowledge. We also introduce the notion of moment recoverability, implicit in many previous analyses of imitation learning, which allows us to cleanly delineate how well each algorithmic family is able to mitigate compounding errors. We derive three novel algorithm templates (AdVIL
, AdRIL
, and DAeQuIL
) with strong guarantees, simple implementation, and competitive empirical performance.
As introduced in the above figure, we formalize imitation learning as matching (a) reward, (b) off-policy Q-value, or (c) on-policy Q-value moments. These have different requirements on the learner and expert, as well as leading to different bounds:
Moment Matched | Env. Access | $\pi_E$ Queries | Upper Bound | Lower Bound | Sample Algorithms |
---|---|---|---|---|---|
Reward | ✓ | ✗ | $J(\pi_E) - J(\pi) \leq O(\epsilon T)$ | $J(\pi_E) - J(\pi) \geq \Omega(\epsilon T)$ | GAIL, MaxEnt IRL, SQIL |
Off-Policy Q | ✗ | ✗ | $J(\pi_E) - J(\pi) \leq O(\epsilon T^2)$ | $J(\pi_E) - J(\pi) \geq \Omega(\epsilon T^2)$ | ValueDICE |
On-Policy Q | ✓ | ✓ | $J(\pi_E) - J(\pi) \leq O(\epsilon HT)$ | $J(\pi_E) - J(\pi) \geq \Omega(\epsilon T)$ | DAgger, GPS, iFAIL |
At first glance, the reward-moment bounds might seem too good to be true (they don't require an interactive expert yet seem to be tighter than the on-policy Q-value bounds). However, this comes at the cost of a potentially exponentially more complex optimization problem to solve. When ordered by computational complexity, tighter bounds correspond to more challenging optimizations. Importantly, these bounds apply for all algorithms in each of these classes.
A natural question at this point might be what $H$ means in the previous table. $H$ is the recoverability constant of a problem and is a property of both the expert's policy and the set of moments considered. Intuitively, it tells us how many timesteps it takes the expert to correct a mistake. $H$ quantifies the difficulty of the problem and tells us how much on-policy corrections from the expert can help -- for $H$ that is $O(T)$, the expert can help very little, while for $H$ that is $O(1)$, on-policy corrections can help a lot.
So, how does one actually match moments? We formulate moment-matching as a zero-sum game and compute equilibrium strategies for the policy player against adversarially chosen divergences that attempts to distinguish between learner and expert samples. We identify two flavors of such algorithms:
As we show in our paper, access to oracles that can perform no-regret and best-response updates allows one to match each of the three types of moments up to any $\epsilon > 0$ efficiently. One can then consult the above table to arrive at policy performance bounds depending on the class of moment that was matched. Thus, our framework gives us a family of reduction-based methods for closing the imitation gap.
One of the advantage of a reductions approach is that it allows us to plug in any no-regret-style procedure and have the performance guarantees still hold. We introduce three novel algorithms which can be seen as realizations of the above no-regret reduction:
AdRIL
: A reward-moment-matching algorithm where the replay buffer can be seen as the discriminator, assigning dynamically updating rewards to samples that the learner uses to compute updates via maximum entropy RL.AdVIL
: An off-$Q$ algorithm with the learner and discriminator taking the form of an IPM-GAN.DAeQuIL
: An on-$Q$ algorithm that can be seen as an extension of DAgger where, at each round, the discriminator adversarially choses a loss function, and the learner minimizes the sum of losses over the aggregated data.We implemented all three of these algorithms in PyTorch and tested them in the PyBullet suite. AdVIL
is competitive with the state-of-the-art, while AdRIL
and DAeQuIL
are able to significantly out-perform baselines on some tasks. We release our code at the link below with the hope that the community will be able to build on and refine our implementations.
Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, Zhiwei Steven Wu
ICML, 2021.
@InProceedings{swamy2021moments,
title = {Of Moments and Matching: A Game-Theoretic Framework for Closing the Imitation Gap},
author = {Gokul Swamy and Sanjiban Choudhury and J. Andrew Bagnell and Zhiwei Steven Wu},
booktitle = {Proceedings of the 38th International Conference on Machine Learning},
year = {2021},
}
This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project, and adapted to be mobile responsive by Jason Zhang. The code we built on can be found here.