One Policy to Run Them All

One Policy to Run Them All: an End-to-end Learning
Approach to Multi-Embodiment Locomotion

Nico Bohlinger¹ Grzegorz Czechmanowski² Maciej Krupka² Piotr Kicki² Krzysztof Walas² Jan Peters¹ Davide Tateo¹

¹Technical University of Darmstadt

²Poznan University of Technology

Conference on Robot Learning (CoRL) 2024

TLDR: We propose the Unified Robot Morphology Architecture (URMA), capable of learning a single general locomotion policy for any legged robot embodiment and morphology.

Interactive Simulation

Test our single multi-embodiment policy trained with URMA interactively in your browser. We provide 9 out of the 16 robots from the training set for you to try out. Videos of all robots and the real-world deployment can be found below.

Unified Robot Morphology Architecture

To handle observations of any morphology, URMA splits observations into robot-specific and general parts. Robot-specific observations are joint (and foot) observations. They have the same structure but vary in their number depending on the robot. We can only use fixed length vectors in neural networks, so we need a mechanism that can take any joint observation and route it into a latent vector that holds the information of all joints. Similar joints from different robots should map to similar regions in the latent vector.

To do this routing, a "language" is needed that can describe a given joint such that the neural network can figure out where to put which joint observation in the latent vector. URMA uses joint description vectors made out of multiple characteristic joint properties to describe any given joint.

Attention Encoder

In practice, URMA implements the observation routing with a simple attention encoder where the joint description vectors act as the keys and the joint observations as the values in the attention mechanism.

Core Network

The same attention encoding is used for the foot observations and the resulting joint and feet latent vectors are concatenated with the general observations and passed to the policy's core network.

Universal Decoder

Finally, we use our universal morphology decoder, which takes the output of the core network and pairs it with the batch of joint descriptions and single joint latents to produce the final action for every given joint.

Training in Simulation

The URMA policy is trained in simulation across 16 different robots simultaneously, including 9 quadrupeds, 6 bipeds / humanoids and 1 hexapod. The policy is trained with Proximal Policy Optimization (PPO) implemented in RL-X for 100M simulation steps per robot. In simulation, the trained policy outperforms classic multi-task RL approaches and shows strong robustness and zero-shot capabilities.

Deployment in the Real World

After training in simulation, the policy is deployed on two quadruped robots from the training set in the real world. Extensive domain randomization during training allows the policy to transfer directly to real robots without any further adaptation.

Deployment on Unseen Robots

Through the big variety of robots in the training set, the randomization of their properties and the morphology-agnostic URMA architecture, the policy can generalize to new robots never seen in the training process.

Acknowledgments

This project was funded by National Science Centre, Poland under the OPUS call in the Weave programme UMO-2021/43/I/ST6/02711, and by the German Science Foundation (DFG) under grant number PE 2315/17-1. Part of the calculations were conducted on the Lichtenberg high performance computer at TU Darmstadt.

Citation

@article{bohlinger2024onepolicy,
    title={One Policy to Run Them All: an End-to-end Learning Approach to Multi-Embodiment Locomotion},
    author={Bohlinger, Nico and Czechmanowski, Grzegorz and Krupka, Maciej and Kicki, Piotr and Walas, Krzysztof and Peters, Jan and Tateo, Davide},
    journal={Conference on Robot Learning},
    year={2024}
}

This website was inspired by Kevin Zakka's and Brent Yi's.