XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Abstract

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in largescale, heterogeneous datasets. In this work, we present XR-1, a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. At its core, XR-1 introduces the Unified Vision-Motion Codes (UVMC), a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 12,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as π0 and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes.

Overview

Overview of XR-1. In XR-1, we introduce the Unified Vision-Motion Codes (UVMC), a discrete latent representation that jointly encodes visual dynamics and robotic motion. XR-1 adopts a three-stage training paradigm to enable precise low-level control across diverse robots and tasks.

Experiment Setup

Experimental Setup. We evaluate XR-1 across six robot embodiments(Tien Kung 1.0/2.0, Single-/Dual-Arm UR-5e, Dual-Arm Franka, and AgileX Cobot Magic 2.0), covering more than 120 manipulation tasks with over 120k rollouts.

XR1 Inference Video Samples

Dual-Arm UR-5e

DUR-StackBowls

DUR-SweepTrash

DUR-FindTapeBasket

Tien Kung 2.0

TK2-CloseDoorKnob

TK2-CollectScrews

TK2-TakeBasketTea

Tien Kung 1.0

TK1-PlaceFlipButton

TK1-PickWipeTowel

TK1-MoveChopstickCup

Dual-Arm Franka

DFR-StackBowls

DFR-SweepRubbish

DFR-TransferCup

AgileX Cobot Magic V2.0

AGX-MeshStackCup

AGX-HangScissors

AGX-PlaceScrewdriver

Single-Arm UR-5e

SUR-PackEggBox

SUR-PourTubeBeaker

SUR-StackCubes

Generalization Setup

DFR-SweepTrash

Base

Unseen Dustpan

Unseen Rub

Unseen Rub&Dynamic Interference

DFR-TransferCup

Base

Unseen Background

Unseen Cup&Static Interference

Unseen Cup&Background&Light

Comparison with Baselines

Representative Tasks Comparison

We conducted evaluations on bimanual collaboration, dexterous manipulation, fluid/deformable object handling, contact-rich interactions, and dynamic environments. Our XR1 model was compared against baseline methods including RDT, π0, GR00T-N1.5, and UniVLA. We selected the best baseline (π0) for comparative demonstration videos. Demonstration videos are shown below: