A Smooth Sea Never Made a Skilled $\texttt{SAILOR}$:
Robust Imitation via Learning to Search
β›΅


Arnav Kumar JainπŸ‘¨β€βœˆοΈ, πŸ›Ÿ
Vibhakar MohtaπŸ‘¨β€βœˆοΈ, πŸ”±
Subin Kimβš“
Atiksh Bhardwajβš“
Juntao Renβš“
Yunhai Fengβš“
Sanjiban Choudhuryβš“
Gokul SwamyπŸ”±


πŸ‘¨β€βœˆοΈ Equal Contribution
πŸ›Ÿ Mila-Quebec AI Institute,
UniversitΓ© de MontrΓ©al
πŸ”± Carnegie Mellon University
βš“ Cornell University




$\texttt{SAILOR}$ β›΅: Searching Across Imagined Latents Online for Recovery


TLDR: We introduce $\texttt{SAILOR}$, a method for learning to search from expert demonstrations. By learning world and reward models from demonstration data, we endow the agent with the ability to, at test time, reason about how to recover from mistakes. $\texttt{SAILOR}$ consistently out-performs Diffusion Policies trained via behavioral cloning on 5-10x as much data.

$\texttt{SAILOR}$ Avoids Mistakes Without Additional Human Feedback

Across a dozen complex visual manipulation tasks, we find that $\texttt{SAILOR}$ is able to avoid mistakes that the base Diffusion Policy makes without any additional demonstrations, DAgger-style corrections, or ground-truth reward labels. $\texttt{SAILOR}$ outperforms Diffusion Policies trained via behavioral cloning across all twelve tasks and three dataset scales considered.



We see that $\texttt{SAILOR}$ is able to avoid mistakes that the base Diffusion Policy makes by searching for self-corrections at test-time.

Abstract

The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach $\texttt{SAILOR}$ consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10$\times$ still leaves a performance gap. We also find that $\texttt{SAILOR}$ can identify nuanced failures and is robust to reward hacking.

$\texttt{SAILOR}$ Architecture: Learning to Search

The $\texttt{SAILOR}$ architecture consists of three key components:

At test time, $\texttt{SAILOR}$ performs a local search around the base policy's $\pi^{\textsf{base}}$ nominal plan. It performs this search inside the learned world model $\texttt{WM}$ and against the learned reward model $\texttt{RM}$. We use the Model-Predictive Path Integral (MPPI) algorithm to perform this search, before executing the first step of the plan, MPC style. More formally, we solve the following problem at each timestep:

$$ \Delta^{*}_{t:t+k} = \arg\max_{\Delta_{t:t+k}} \mathbb{E}_{z_t \sim \texttt{WM}(z_{t-1}, a_{t-1})}\left[\sum_{h=0}^{k-1} \gamma^h \texttt{RM}(z_{t+h}) +\gamma^k\texttt{V}(z_{t+k}) \bigg \vert o_t, a_{t:t+k} = a^{\textsf{base}}_{t:t+k} + \Delta_{t:t+k} \right].$$

We then execute the first of these actions, $a^{\textsf{base}}_{t}+\Delta_{t}^{*}$, in the environment before re-planning on top of the fresh observation $o_{t+1}$. In short, $\texttt{SAILOR}$ learns to search without any additional human feedback (e.g. DAgger-style corrections) or ground-truth reward labels. It then searches at test-time to correct errors that the base policy would have made. One can then distill these corrected plans into the base policy in an expert-iteration style procedure.

$\texttt{SAILOR}$ Can Identify And Recover from Nuanced Failures

We now explore qualitatively what our learned $\texttt{RM}$ is detecting and how robust $\texttt{SAILOR}$ is. To do so, we roll out the base policy until it fails at the given task (purple line), reset the agent to a state where failure is not yet guaranteed, and let $\texttt{SAILOR}$ complete the rest of the episode (orange line). The values plotted below are from our learned $\texttt{RM}$, which was trained without any ground-truth rewards.




$\texttt{SAILOR}$ Out-Performs Diffusion Policies Trained on 5-10x Data

A natural question after viewing the preceding results might be as to whether more demonstrations could close the performance gap between $\texttt{SAILOR}$ and Diffusion Policies. To answer this, we scale up the number of demonstrations used for behavioral cloning by 5-10$\times$. We find that even with this significant increase in data, Diffusion Policies still do not match the performance of $\texttt{SAILOR}$. While Diffusion Policies do improve with more data, we find that their performance plateaus before they match $\texttt{SAILOR}$'s performance.

This reflects the fact that learning to search algorithms like $\texttt{SAILOR}$ are able to take advantage of generation-verification gaps for more sample-efficient learning as they learn reward models (i.e. verifiers ) in addition to policies (i.e. generators ).

$\texttt{SAILOR}$ Scales Robustly at Test-Time

A perennial concern with test-time scaling approaches is reward hacking -- the possibility that the agent optimizes the learned proxy reward to the point that it no longer correlates with the true task objective. To ablate how robust $\texttt{SAILOR}$ is to reward hacking, we perform an experiment where we scale up the number of samples used for MPPI, which is analogous to scaling $N$ in Best-Of-N


Put together, these results tell us that an interesting direction for future work that builds on the $\texttt{SAILOR}$ stack is to adaptively allocate compute at test-time to reason about recovery behavior only when the base policy is likely to make a mistake.

Paper πŸ“„

A Smooth Sea Never Made a Skilled $\texttt{SAILOR}$: Robust Imitation via Learning to Search

Arnav Kumar Jain*, Vibhakar Mohta*, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, and Gokul Swamy

@article{jain2025sailor,
            title={A Smooth Sea Never Made a Skilled πš‚π™°π™Έπ™»π™Ύπš: Robust Imitation via Learning to Search},
            author={Arnav Kumar Jain and Vibhakar Mohta and Subin Kim and Atiksh Bhardwaj and Juntao Ren and Yunhai Feng and Sanjiban Choudhury and Gokul Swamy},
            journal={CoRR},
            volume={abs/2506.05294},
            year={2025}
        }


This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project, and adapted to be mobile responsive by Jason Zhang. The code we built on can be found here.