XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

We introduce XR-1, a versatile and scalable vision-language-action framework. XR-1 supports robust multi-task learning across diverse robot embodiments and environments

Abstract

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. At its core, XR-1 introduces the Unified Vision-Motion Codes (UVMC), a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $\pi_{0.5}$, $\pi_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes.



Overview

Overview of XR-1. In XR-1, we introduce the Unified Vision-Motion Codes (UVMC), a discrete latent representation that jointly encodes visual dynamics and robotic motion. XR-1 adopts a three-stage training paradigm to enable precise low-level control across diverse robots and tasks.



Stage-1: Learning Unified Vision-Motion Codes

In Stage-1, we learn the Unified Vision-Motion Codes (UVMC) through a dual-branch VQ-VAE architecture. The visual encoder and motion encoder jointly encode visual observations and action sequences into a shared discrete codebook. This unified representation bridges the semantic gap between human demonstrations and robot actions, enabling effective knowledge transfer across diverse robot embodiments and environments.

Stage-2: UVMC-Guided Pretraining for Generalist Policy

In Stage 2, after extracting the Unified Vision-Motion Code (UVMC) from the dual-branch VQ-VAE as the supervision signals, we integrate this representation into the policy learning process to enhance low-level action prediction. We pretrain the policy using large-scale cross-embodiment robotic datasets XR-D (158k trajectories and 69.1M frames).



Stage-3: Post-Training for Deployment

In Stage 3, after large-scale UVMC-guided pretraining, the model develops strong capabilities in extracting unified vision-motion knowledge and generating precise low-level actions.



Experiment Setup

Experimental Setup. We evaluate XR-1 across six robot embodiments(Tien Kung 1.0/2.0, Single-/Dual-Arm UR-5e, Dual-Arm Franka, and AgileX Cobot Magic 2.0), covering more than 120 manipulation tasks with over 14,000 rollouts.



XR1 Inference Video Samples

Dual-Arm UR-5e

DUR-StackBowls

DUR-SweepTrash

DUR-FindTapeBasket

Tien Kung 2.0

TK2-CloseDoorKnob

TK2-CollectScrews

TK2-TakeBasketTea

Tien Kung 1.0

TK1-PlaceFlipButton

TK1-PickWipeTowel

TK1-MoveChopstickCup

Dual-Arm Franka

DFR-StackBowls

DFR-SweepRubbish

DFR-TransferCup

AgileX Cobot Magic V2.0

AGX-MeshStackCup

AGX-HangScissors

AGX-PlaceScrewdriver

Single-Arm UR-5e

SUR-PackEggBox

SUR-PourTubeBeaker

SUR-StackCubes



Generalization Setup

DFR-SweepTrash

Base

Unseen Dustpan

Unseen Rub

Unseen Rub&Dynamic Interference

DFR-TransferCup

Base

Unseen Background

Unseen Cup&Static Interference

Unseen Cup&Background&Light



Comparison with Baselines

Representative Tasks Comparison

We conducted evaluations on bimanual collaboration, dexterous manipulation, fluid/deformable object handling, contact-rich interactions, and dynamic environments. Our XR1 model was compared against baseline methods including RDT, π0.5, π0, GR00T-N1.5, and UniVLA. We selected the best baseline (π0.5) for comparative demonstration videos. Demonstration videos are shown below:

Bimanual Collaboration: DUR-TransCupHolder

Icon Baseline

Icon Ours

Dexterous Manipulation: DUR-CloseDoorKnob

Icon Baseline

Icon Ours

Fluid Object Handling: SUR-PourTubeBeaker

Icon Baseline

Icon Ours

Deformable Object Handling: DFR-HangTowelRack

Icon Baseline

Icon Ours

Contact-Rich Interactions: DFR-SweepRubbish

Icon Baseline

Icon Ours

Dynamic Environments: DUR-TransButtons

Icon Baseline

Icon Ours

Few-shot Comparison

We conducted few-shot learning experiments on the Dual-ArmUR-5e and Tien Kung 2.0 robotic systems, performing comparative analysis against single-task baseline methods including ACT and DP. The experimental results and comparative evaluation are presented as follows:

TK2-PlaceCircuit

Icon ACT

Icon DP

Icon Ours



Experiment Result

Main experiment: 6 different embodiments, with 20 tasks per embodiment, comparing results against RDT, π0.5, π0, GR00T-N1.5, and UniVLA baselines.

Dual-Arm UR-5e_results

Tien Kung 2.0_results

Tienkung1_results

Dual-Arm Franka_results

AgileX Cobot Magic V2.0_results

Single-Arm UR-5e_results



Training Dynamics of UVMC and Limitations

Description: We provide pre-training logs to demonstrate the robustness of our hierarchical representation. Stable convergence: both the vision reconstruction loss and the action reconstruction loss decrease rapidly in the early stage of training and then converge to a stable plateau. Despite the high diversity of the heterogeneous pre-training data (Ego4D, OXE, Robomind, and XR-D), this steady convergence indicates that UVMC can effectively capture high-level dynamics. Codebook vitality: we further monitor the Active Code Number, defined as the number of unique codes activated within each batch on each GPU. The vision codes stabilize at approximately 30, while the motion codes reach a healthy plateau of approximately 50. This trend shows that our pre-training avoids codebook collapse and preserves strong representational capacity. The absence of sudden drops in code usage further confirms that the discrete latent space is consistently utilized to encode complex cross-modal features.

Limitations: One limitation of UVMC is that it uses images sampled every k frames, which may miss intermediate process information. Using all consecutive frames could provide richer vision-dynamic features, but would also introduce substantial computational overhead due to the large number of visual tokens. A promising future direction is therefore to learn finer-grained latent features while selectively retaining the most informative tokens and removing redundant ones.

Vision Reconstruction Loss

Action Reconstruction Loss

Vision code number (per batch per GPU)

Motion code number (per batch per GPU)



Visualization of future frame prediction

Description: To validate that the joint representation remains effective after fine-tuning, we provide qualitative future-frame reconstructions for tasks such as DUR-FindTapeBasket and DUR-SweepTrash on our project website. The key result is that the predicted UVMC tokens are essential for reconstruction quality. Our UVMC decoder is intentionally 4x smaller than the encoder (Table 5), so successful reconstruction cannot rely on decoder memorization and must instead come from the information carried by the vision-motion codes. We therefore compare reconstruction using predicted vision codes, zeroed codes, and randomized codes. We find that zeroed or randomized codes completely fail to recover scene dynamics, while the predicted UVMC tokens enable the lightweight decoder to reconstruct high-fidelity future states. This directly confirms that the predicted tokens preserve the joint visual-dynamic information required for sequential manipulation.

DUR-FindTapeBasket

source frame

future frame

Icon

reconstruction frame (vision code)

Icon

reconstruction frame (zero code)

Icon

reconstruction frame (random code)

DUR-SweepTrash

source frame

future frame

Icon

reconstruction frame (vision code)

Icon

reconstruction frame (zero code)

Icon

reconstruction frame (random code)