Given robot images, RoVi-Aug uses state-of-the-art diffusion models to augment the data and generate synthetic images with different robots and viewpoints. Policy trained on the augmented dataset can be deployed on the target robots zero-shot or further finetuned, exhibiting robustness to camera pose changes.Credit: Chen et al.

New data augmentation algorithm could facilitate the transfer of skills across robots

10 Oct 2024, 10:30 by Ingrid Fadelli, Tech Xplore · Tech Xplore

In recent years, roboticists have developed a wide range of systems designed to tackle various real-world tasks, ranging from completing household chores to delivering packages or finding target objects in delineated environments.

A key objective in the field has been to develop algorithms that allow the reliable transfer of specific skills across robots with different bodies and characteristics, which would help to rapidly train robots on new tasks, broadening their capabilities.

Researchers at UC Berkeley have developed RoVi-Aug, a new computational framework designed to augment robotic data and facilitate the transfer of skills across different robots. Their proposed approach, outlined in a paper pre-published on arXiv and set to be presented at the 2024 Conference on Robot Learning (CoRL), utilizes state-of-the-art generative models to augment image data and create synthesized visual task demonstrations with varying camera views for distinct robots.

"The success of modern machine learning systems, particularly generative models, demonstrates impressive generalizability and motivated robotics researchers to explore how to achieve similar generalizability in robotics," Lawrence Chen (Ph.D. Candidate, AUTOLab, EECS & IEOR, BAIR, UC Berkeley) and Chenfeng Xu (Ph.D. Candidate, Pallas Lab & MSC Lab, EECS & ME, BAIR, UC Berkeley), told Tech Xplore.

"We have been investigating the problem of cross-viewpoint and cross-robot generalization since the start of this year."

When conducting their previous research, Chen, Xu and their colleagues identified some of the challenges to the generalization of learning across different robots. Specifically, they found that when the scenes included in robotics datasets are unevenly distributed, for instance, containing a predominance of specific robot visuals and camera angles over others, this makes them less effective for teaching different robots the same skills.

Interestingly, the researchers found that many existing robot training datasets are unbalanced, including some of the most well-established. For instance, even the Open-X Embodiment (OXE) dataset, a dataset that is widely used for training robotics algorithms and contains demonstrations of different robots completing varying tasks, contains more data for some robots, such as the Franka and xArm manipulators.

"Such biases in the dataset make the robot policy model tend to overfit to specific robot types and viewpoints," said Chen and Xu.

"To mitigate this issue, in February 2024, we proposed a test-time adaptation algorithm, Mirage, that uses 'cross-painting' to transform an unseen target robot into the source robot seen during training, creating the illusion that the source robot is performing the task at test time."

Mirage, the algorithm that the researchers introduced in their previous paper, was found to attain the zero-shot transfer of skills to unseen target robots. Nonetheless, the model was found to come with various limitations.

Firstly, to work well, Mirage requires precise robot models and camera matrices. In addition, the algorithm does not support the fine-tuning of robot policies and is limited to processing images with few changes in camera pose, as it is prone to commit errors in the reprojection of image depth.

"In our latest work we present an alternative algorithm called RoVi-Aug," said Chen and Xu. "The aim of this algorithm is to overcome the limitations of Mirage by enhancing the robustness and generalizability of policies during training, focusing on handling diverse robot visuals and camera poses, rather than relying on the test-time cross-painting approach with stringent assumptions on the known camera poses and robot URDFs (unified robot description formats)."

RoVi-Aug, the new robot data augmentation framework introduced by the researchers, is based on state-of-the-art diffusion models. These are computational models that can augment images of a robot's trajectories, generating synthetic images showing different robots completing tasks, viewed from varying viewpoints.

Overview of the RoVi-Aug pipeline. Given an input robot image, we first segment the robot out using a finetuned SAM model, then use a ControlNet to transform the robot into another robot. After pasting the synthetic robot back into the background, we use ZeroNVS to generate novel views.Credit: Chen et al.

The researchers used their framework to compile a dataset containing a wide range of synthetic robot demonstrations and then trained robot policies on this dataset. This in turn allows the transfer of skills to new robots that have not been previously exposed to the task included in the demonstration, which is known as zero-shot learning.

Notably, the robot policies can also be fine-tuned to achieve increasingly better performances at a given task. In addition, contrarily to the Mirage model introduced in the team's previous paper, their new algorithm can support drastic changes in camera angles.

"Unlike test-time adaptation methods like Mirage, RoVi-Aug doesn't require any extra processing during deployment, doesn't rely on knowing camera angles in advance, and supports policy fine-tuning," explained Chen and Xu. "It also goes beyond traditional co-training on multi-robot, multi-task datasets by actively encouraging the model to learn the full range of robots and skills across the datasets."

The RoVi-Aug model has two distinct components, namely the robot augmentation (Ro-Aug) and the viewpoint augmentation (Vi-Aug) modules. The first of these components is designed to synthesize demonstration data featuring different robotic systems, while the second can produce demonstrations viewed from different angles.

"Ro-Aug has two key features: a fine-tuned SAM model to segment the robot and a fine-tuned ControlNet to replace the original robot with a different one," said Chen and Xu. "Meanwhile, Vi-Aug leverages ZeroNVS, a state-of-the-art novel view synthesis model, to generate new perspectives of the scene, making the model adaptable to various camera viewpoints."

As part of their study, the researchers used their model to produce an augmented robot dataset and then tested the effectiveness of this dataset for training policies and transferring skills across different robots. Their initial findings highlight the potential of Rovi-Aug, as the algorithm was found to enable the training of policies that generalize well across different robots and camera set-ups.

"Its key innovation lies in applying generative models—such as image-to-image generation and novel view synthesis—to the challenge of cross-embodiment robot learning," explained Chen and Xu.

"While previous work has used generative augmentation to improve policy robustness in the face of distracting objects and backgrounds, RoVi-Aug is the first to show how this approach can facilitate skill transfer between different robots."

This recent work by Chen and Xu could contribute to the advancement of robots, by helping robotics researchers to easily broaden their systems' skill set. In the future, it could be used by other teams to transfer skills between different robots or develop more effective general-purpose robotic policies.

"For instance, imagine a scenario where a researcher has spent significant effort collecting data and training a policy on a Franka robot to perform a task, but you only have a UR5 robot," said Chen and Xu.

"RoVi-Aug allows you to repurpose the Franka data and deploy the policy on the UR5 robot without additional training. This is particularly useful because robot policies are often sensitive to camera viewpoint changes, and setting up identical camera angles across different robots is challenging. RoVi-Aug eliminates the need for such precise setups."

As collecting large amounts of robot demonstrations in the real world can be very expensive and time-consuming, RoVi-Aug could be a cost-effective alternative for easily compiling reliable robot training datasets.

While the images in these datasets would be synthetic (i.e., generated by AI), they could still prove useful for producing reliable robot policies. The researchers are currently working with colleagues at Toyota Research Labs and other institutes on applying and extending their approach to other robot datasets.

"We now aim to further refine RoVi-Aug by incorporating recent developments in generative modeling techniques, such as video generation in place of image generation," added Chen and Xu.

"We also plan to apply RoVi-Aug to existing datasets like the Open-X Embodiment (OXE) dataset, and we are excited about the potential to enhance the performance of generalist robot policies trained on this data. Expanding RoVi-Aug's capabilities could significantly improve the flexibility and robustness of these policies across a wider range of robots and tasks."

More information: Lawrence Yunliang Chen et al, RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning, arXiv (2024). DOI: 10.48550/arxiv.2409.03403
Journal information: arXiv