RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data

Overview

RDT2, the sequel to RDT-1B, is the first foundation model that can achieve zero-shot deployment on unseen embodiments for simple open-vocabulary tasks like picking, placing, shaking, wiping, etc. This milestone was made possible by multifaceted efforts:

We redesigned the UMI hardware by applying higher-strength materials and more precise tracking methods, ensuring its reliability for large-scale data collection.
We collected 10,000+ hours of human manipulation videos in 100+ different indoor scenes, covering the majority of household tasks that a gripper can do.

Currently, this repo contains models:

the RDT2-VQ, an auto vision-language-action model (VLA) which employs Residual VQ as the action tokenizer, is adapted from Qwen2.5-VL-7B-Instruct with our UMI dataset, enabling superior zero-shot instruction-following capability.
the RDT2-FM, an improved RDT model as action expert with flow-matching objective, running with much lower inference latency.

For all models, we provide checkpoints and examples for using them out of the box or fine-tuning them to your own datasets. Currently, we have verified the efficay of our models on platforms including Bimanual UR5e and Bimanual Franka Research 3, and we are optimistic will able to deploy them successfully on more platforms in the future by following our guidelines.

Updates

[Sept 2025] We released RDT2-VQ & RDT2-FM, the sequel of RDT-1B with better open-world generalization and zero-shot deployment on unseen embodiments.

Requirements

To run the models in this repository, you will need an NVIDIA GPU with at least the following specifications. These estimations assume a single GPU, but you can also use multiple GPUs with model parallelism or offload into CPU to reduce per-GPU memory. Since RDT2 is based on Qwen2.5-VL-7B, you basiclly need to follow the hard-ware requirements for Qwen2.5-VL-7B:

Mode	RAM Required	VRAM Required	Example GPU
Inference	> 32 GB	~ 16 GB	RTX 4090
Fine-Tuning RDT2-FM (RDT Expert)	-	~ 16 GB	RTX 4090
Fine-Tuning RDT2-VQ (LoRA)	-	> 32 GB	A100 (40GB)
Fine-Tuning RDT2-VQ (Full)	-	> 80 GB	A100 (80GB) / H100 / B200

As for zero-shot deployment, you need to purchase the designated end effector and camera, and 3D print the corresponding camera stand and flange according to Harware Set up and Calibration.

The repo has been tested with Ubuntu 24.04, we do not currently support other operating systems.

Installation

Clone this repo and create a conda environment:

# clone the repo
git clone https://github.com/thu-ml/RDT2.git
cd RDT2

# create a conda environment
conda create -n rdt2 python=3.10 -y
conda activate rdt2

# install torch (cuda12.8)
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128

# install flash 
pip install flash-attn --no-build-isolation

# install other dependencies
pip install -r requirements.txt

# upgrade nvidia-nccl-cu12
pip install --upgrade --force-reinstall nvidia-nccl-cu12==2.27.5

# Double check that you have the right transformers 4.51.3 installed
pip list | grep transformers

# to deploy on UR5e
pip install -r requirements/ur5e.txt

# to deploy on Franka Research 3
pip install -r requirements/franka_research_3.txt

Model Checkpoints

We provide multiple VLA model checkpoints with capabilities to deploy on various robot platforms and simple vocabulary tasks. If you want to deploy on your own robot platform with other end effectors and cameras, you can fine-tune from the base model.

Model	Use Case	Description	Checkpoint Path
normalizer	Inference & Fine-Tuning (Freeze)	Normalizer for action normalization	umi_normalizer_wo_downsample_indentity_rot.pt
Residual VQ	Inference & Fine-Tuning (Freeze)	Residual VQ (RVQ) as the action tokenizer	`robotics-diffusion-transformer/RVQActionTokenizer`
RDT2-VQ	Inference & Fine-Tuning	Auto-regressive VLA with Residual VQ as the action tokenizer	`robotics-diffusion-transformer/RDT2-VQ`
RDT2-FM	Inference & Fine-Tuning	Auto-regressive VLA (RDT2-VQ) with Flow-Matching Action Expert	`robotics-diffusion-transformer/RDT2-FM`

Running Inference for a Pre-Trained Model

1. [IMPORTANT] Hard-ware Set up and Calibration

Acquire deployment hardware according to our Hardware Guide.
Set up Robots

2.1 Set up UR5e
- Obtain IP address and update configs/robots/eval_bimanual_ur5e_config.yaml/robots/robot_ip.
- In Installation > Payload
  - Set mass to 0.82 kg
  - Set Inertia Matrix to
```
[0.001106, 0, 0,
 0, 0.001106, 0,
 0, 0, 0.001106]
```
  - Set speed to 30%(recommened)
2.2 Set up Franka FR3
- Obtain IP address and update configs/robots/eval_bimanual_fr3_config.yaml/robots/robot_ip.
- On the Franka interface website
  - Set gripper mass to 1.9 kg
  - Set Inertia Tensor to
```
[0.001, 0, 0,
 0, 0.001, 0,
 0, 0, 0.001]
```

Set up camera
- Download SDK from HikRobot website and install all the .deb files.
- Run cd /opt/MVS/bin && ./MVS.sh. Select your camera, and set Acquisition Control -> Exposure Time to 20000.
Calibrate your robot to tracker's tcp space

Follow Setup Instructions For Calibration in Hardware Guide.
Set up Vive Tracker according to this tutorial ->Software Setup Tutorial ->VIVE tracker setup
Run the following code to calibrate robot tcp space to tracker's space.

IMPORTANT: This script makes the robot perform small-amplitude sinusoidal motions; before running the script, please ensure the robot is in a safe position and the workspace is free of obstacles.

python deploy/calibration/calibrate_franka.py --franka_ip <your_franka_server_ip> --franka_port <your_franka_server_port> # if using Franka Research 3
# or
python deploy/calibration/calibrate_ur5e.py --ur5e_ip <your_ur5e_ip> # if using UR5e

After calibration, run the following script to obtain the calibration matrix:
```
python deploy/calibration/compute_calibration_matrix.py
```
Then paste the calibration matrix to eval_bimanual_ur5e_config.yaml/tx_tracker_to_tcp (or eval_bimanual_fr3_config.yaml/tx_tracker_to_tcp if using FR3).

2. Run Inference

Our pre-trained model checkpoints can be run with a few lines of code (here our RDT2-VQ model):

import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

from vqvae import MultiVQVAE
from models.normalizer import LinearNormalizer
from utils import batch_predict_action

# assuming using gpu 0
device = "cuda:0"


processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "robotics-diffusion-transformer/RDT2-VQ"
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map=device
).eval()
vae = MultiVQVAE.from_pretrained("robotics-diffusion-transformer/RVQActionTokenizer").eval()
vae = vae.to(device=device, dtype=torch.float32)

valid_action_id_length = (
    vae.pos_id_len + vae.rot_id_len + vae.grip_id_len
)
# TODO: modify to your own downloaded normalizer path
# download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
normalizer = LinearNormalizer.from_pretrained("umi_normalizer_wo_downsample_indentity_rot.pt")  # 

result = batch_predict_action(
    model,
    processor,
    vae,
    normalizer,
    examples=[
        {
            "obs": {
                # NOTE: following the setting of UMI, camera0_rgb for right arm, camera1_rgb for left arm
                "camera0_rgb": ..., # right arm RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
                "camera1_rgb": ..., # left arm RGB image in np.ndarray of shape (1, 384, 384, 3) with dtype=np.uint8
            },
            "meta": {
                "num_camera": 2
            }
        },
        ...,    # we support batch inference, so you can pass a list of examples
    ]，
    valid_action_id_length=valid_action_id_length,
    apply_jpeg_compression=True,
    # Since model is trained with mostly jpeg images, we suggest toggle this on for better formance
    instruction="Pick up the apple."
    # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period 
)

# get the predict action from example 0
action_chunk = result["action_pred"][0] # torch.FloatTensor of shape (24, 20) with dtype=torch.float32
# action_chunk (T, D) with T=24, D=20
#   T=24: our action_chunk predicts the future 0.8s in fps=30, i.e. 24 frames
#   D=20: following the setting of UMI, we predict the action for both arms from right to left
#   - [0-2]: RIGHT ARM end effector position in x, y, z (unit: m)
#   - [3-8]: RIGHT ARM end effector rotation in 6D rotation representation
#   - [9]: RIGHT ARM gripper width (unit: m)
#   - [10-12]: LEFT ARM end effector position in x, y, z (unit: m)
#   - [13-18]: LEFT ARM end effector rotation in 6D rotation representation
#   - [19]: LEFT ARM gripper width (unit: m)

# rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1

And you can also test the RDT2-FM with the following code:

# Run under root directory of our repo
import yaml

from models.rdt_inferencer import RDTInferencer


with open("configs/rdt/post_train.yaml", "r") as f:
  model_config = yaml.safe_load(f)

model = RDTInferencer(
  config=model_config,
  pretrained_path="robotics-diffusion-transformer/RDT2-FM",
  # TODO: modify `normalizer_path` to your own downloaded normalizer path
  # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
  normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",  
  pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
  device="cuda:0",
  dtype=torch.bfloat16,
)

result = model.step(
    observations={
        'images': {
            'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
            'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
        },
        # use zero input current state for currently
        # preserve input interface for future fine-tuning
        'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
    },
    instruction="Pick up the apple." # Language instruction
    # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period 
)


# relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
# with the same format as RDT2-VQ
action_chunk = result.detach().cpu().numpy()

# rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1

We provide detailed step-by-step examples for running inference of our pre-trained checkpoints on Bimanual UR5e and Bimanual Franka Research 3 robots.

IMPORTANT: If the inference success rate is still low after checking all the settings, configurations and calibrations, you can refer to the deployment tips for help.

Fine-Tuning Models on Your Own Data

We will fine-tune the RDT2 models on the example dataset from Bimanual UR5e as a running example for how to fine-tune a base model on your own data. We will explain three steps:

Convert your data to a webdataset shards (which we use for training for high-efficent IO)
Define training configs
Run training

1. Convert your data to WebDataset shards

You should convert to a processed webdataset shards, with the following structure:

shard-000000.tar
├── 0.image.jpg   # Binocular (left wrist camera + right wrist camera) RGB image in np.ndarray of shape (384, 768, 3) with dtype=np.uint8
├── 0.action.npy  # Relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
├── 0.action_token.npy # Corresponding action token in np.ndarray of shape (27,) ranging from 0 to 1024 with dtype=np.int16
├── 0.meta.json # Meta data including key `sub_task_instruction_key` to index the corresponding instruction from `instructions.json`
├── 1.image.jpg
├── 1.action.npy
├── 1.action_token.npy
├── 1.meta.json
├── ...
shard-000001.tar
shard-000002.tar
...

Moreover, we provde processed example data collected with Bimanual UR5e on huggingface. You can download it and use it directly.

2. Defining training configs and running training

Define your dataset config following format in configs/datasets/example.yaml

# Define your dataset name here
name: <your_dataset_name> # e.g. bimanual/ur_example
type: single
shards_dir: <your_shards_dir> # e.g. /ssd/rdt2/bimanual_fold_cloth/shards 
kwargs:
  instruction_path: <your_instruction_path> # e.g. /ssd/rdt2/ur_example/instruction.json
  normalizer_path: <your_normalizer_path> # e.g. /ssd/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt

For the provided example data, its corresponding config is in configs/datasets/example.yaml. Remember to replace the <root_dir> and <path_to_normalizer> with your own path for downloading.

3. Run training

RDT2-VQ

Currently, we support the following fine-tuning methods:

DeepSpeed training
LoRA (low-rank adaptation) training

Since RDT2-VQ is based on Qwen2.5-VL, you are free to apply using other techniques including (e.g., fsdp, quantization) by following Qwen2.5-VL's fine-tunig practices. We provide example fine-tuning scripts for full-parameter and LoRA fine-tuning, which you can directly use to kick off your own training.

To provide a better understanding, we elaborate the line-by-line explanation of the full-parameter fine-tuning script (scripts/finetune_full_param.sh) with our example data:

# Define your env settings here 
# e.g., nccl, network, proxy, etc.

TASK="bimanual-ur5e-example"  # Define your task name here
DATASET_CONFIG_PATH="configs/datasets/example.yaml"  # Define your dataset config path here

export TOKENIZER_ID="Qwen/Qwen2.5-VL-7B-Instruct"
export VAE_ID="robotics-diffusion-transformer/RVQActionTokenizer" 
export MODEL_ID="robotics-diffusion-transformer/RDT2-VQ"
export OUTPUT_DIR="outputs/vqvla-sft-${TASK}" # Define your output directory here

if [ ! -d "$OUTPUT_DIR" ]; then
    mkdir "$OUTPUT_DIR"
    echo "Folder '$OUTPUT_DIR' created"
else
    echo "Folder '$OUTPUT_DIR' already exists"
fi

accelerate launch main.py \
    --deepspeed="scripts/zero1.json" \  # Deepspeed config file, you can modify it to your own using other sharding strategies
    --tokenizer_name=$TOKENIZER_ID \
    --vae_name=$VAE_ID \
    --pretrained_model_name_or_path=$MODEL_ID \
    --output_dir=$OUTPUT_DIR \
    --train_batch_size=64 \
    --eval_batch_size=32 \
    --max_train_steps=10000 \ # We suggest training less than 5 epochs to avoid overfitting, 
                              # you should estimate the number of steps for your data and set it accordingly
    --eval_strategy="no" \
    --logging_steps=25 \
    --checkpoints_total_limit=20 \
    --checkpointing_step=1000 \
    --lr_scheduler="cosine" \
    --learning_rate=1e-5 \
    --mixed_precision="bf16" \
    --dataloader_num_workers=16 \
    --gradient_checkpointing \
    --log_level="info" \
    --report_to="wandb" \
    --lr_warmup_steps=500 \
    --dataset=$DATASET_CONFIG_PATH \
    --image_corruption \ # We suggest toggle this on for better vision robustness
    --use_default_collate_fn_for_eval

Although our RVQ demonstrates high generalization among both hand-held gripper data and real robot data. If you want to fine-tune on your own data with our Residual VQ as action tokenzer, we sincerely suggest you firstly to check the statistics of your data are within the bound of our Residual VQ, and then test the reconstruction error of your data.

RDT2-FM

Currently, we support fine-tuning RDT2-FM's Action Expert with DeepSpeed: We provide example fine-tuning scripts for full-parameter action expert fine-tuning. After specifying your own dataset config path and replacing the <repository-path> in full-parameter action expert with your own repository path, you can directly run this script to kick off training.

Precision Settings

Different models have their specific precision settings:

Action Tokenizer (Residual VQ):

Since the size of Residual VQ is very small, we use float32 for both training and inference.

RDT VLM (RDT2-VQ):

Uses full bfloat16 (default) following Qwen2.5-VL. You can follow the practice for Qwen2.5-VL to adjust the precision by applying techniques like mixed precision or quantization.

RDT Action Expert (RDT2-FM):

Uses full bfloat16 for both training and inference.

Troubleshooting

We will collect common issues and their solutions here. If you encounter an issue, please check here first. If you can't find a solution, please file an issue on the repo (see here for guidelines).

Issue	Resolution
🚧 In progress 🚧	🚧 In progress 🚧

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
assets		assets
configs		configs
data		data
deploy		deploy
examples		examples
models		models
rdt		rdt
requirements		requirements
scripts		scripts
vqvae/models		vqvae/models
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py
vla_trainer.py		vla_trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data

Table of Contents

Overview

Updates

Requirements

Installation

Model Checkpoints

Running Inference for a Pre-Trained Model

1. [IMPORTANT] Hard-ware Set up and Calibration

2. Run Inference

Fine-Tuning Models on Your Own Data

1. Convert your data to WebDataset shards

2. Defining training configs and running training

3. Run training

RDT2-VQ

RDT2-FM

Precision Settings

Troubleshooting

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

thu-ml/RDT2

Folders and files

Latest commit

History

Repository files navigation

RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data

Table of Contents

Overview

Updates

Requirements

Installation

Model Checkpoints

Running Inference for a Pre-Trained Model

1. [IMPORTANT] Hard-ware Set up and Calibration

2. Run Inference

Fine-Tuning Models on Your Own Data

1. Convert your data to WebDataset shards

2. Defining training configs and running training

3. Run training

RDT2-VQ

RDT2-FM

Precision Settings

Troubleshooting

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages