Self-Play and Self-Describe for Unseen Tasks and Environments
Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models
Recent progress on vision-language foundation models has brought significant advancement to building general-purpose robots.
While this is encouraging, the policy still fails in most cases given an unseen task or environment.
To adapt the policy to unseen tasks and environments, we explore a new paradigm on leveraging the pre-trained foundation models with self-play and self-describe (splayd).
When deploying the trained policy to a new task or a new environment, we first let the policy self-play with randomly generated instructions to record the demonstrations.
While the execution could be wrong, we can accurately self-describe (i.e., re-label or classify) the demonstrations.
This automatically provides new pairs of demonstration-instruction data for policy fine-tuning.
We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer.
We show that our method improves baselines by a large margin in all cases.
Authors
Yuying Ge, Annabella Macaluso, Li Erran Li, Ping Luo, Xiaolong Wang
In this paper, we propose to leverage the pre-trained vision-language foundation models to correct or adapt the policies during testing on unseen tasks and environments.
Specifically, we introduce a elf-and self-escribe policy adaptation pipeline.
When adapting a trained policy to a new task or new environment, we first let the policy, that is, the model continuously generates and performs actions given a series of language instructions in the new task and we record the demonstrations including the visual observations and model s actions.
Of course, the instructions and the outcome demonstrations will often not match under the out-of-distribution environment.
By taking the visual observations of recorded demonstrations as inputs, the foundation model can retrieve accurate language instructions correspondingly.
We then make the correction by performing, that is, the recorded demonstrations can be automatically re-labeled by the pre-trained vision-language foundation model.
We carefully design a broad range of language conditioned robotic adaptation experiments to evaluate the policy adaptation across object composition, tasks and environments including from simulation to the real world.
Result
In this work, we propose a self-play and self-describe policy adaptation pipeline (splayd), which leverages the pre-trained vision-language foundation model to collect the data for fine-tuning the policy in unseen tasks and environments automatically.
We evaluate our method on a broad range of language conditioned policy adaptation experiments including compositional generalization, out-of-distribution generalization and sim-to-real transfer, and show great superiority of our method.