Skip to content

UMass-Embodied-AGI/TesserAct

Repository files navigation

TesserAct: Learning 4D Embodied World Models

ICCV 2025

Haoyu Zhen*, Qiao Sun*, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan

Paper PDF Project Page Model Hugging Face

We propose TesserAct, the first open-source and generalized 4D World Model for robotics, which takes input images and text instructions to generate RGB, depth, and normal videos, reconstructing a 4D scene and predicting actions.

Logo


Tabel of Contents
  1. Installation
  2. Data Preparation
  3. Training
  4. Inference
  5. Citation
  6. Acknowledgement

News

  • [2025-06-25] TesserAct is accepted to ICCV 2025!
  • [2025-06-19] We provide an efficient RGB+Depth+Normal LoRA fine-tuning script for custom datasets.
  • [2025-06-18] We provide a RGB-only LoRA inference script that achieves the best generalization ability for robotics video generation.
  • [2025-06-06] We have released the training code and data generation scripts!
  • [2025-05-05] We have updated the gallery and added more results on the project website.
  • [2025-05-04] We add USAGE.MD to provide more details about the models and how to use the models on your own data!
  • [2025-04-29] We have released the inference code and TesserAct-v0.1 model weights!

Installation

Create a conda environment and install the required packages:

conda create -n tesseract python=3.9
conda activate tesseract
pip install -r requirements.txt

git clone https://github.com/UMass-Embodied-AGI/TesserAct.git
cd TesserAct
pip install -e .

Data Preparation

Please refer to DATA.md for data generation scripts and dataset preparation.

Training

Pre-training or Full Fine-tuning

To pre-train the full TesserAct model from CogVideoX, we provide a training script based on Finetrainers. The training code supports distributed training with multiple GPUs or multi-nodes.

To pre-train our TesserAct model, run the following command:

bash train_i2v_depth_normal_sft.sh

To fine-tune our released TesserAct model, modify the model loading code in tesseract/i2v_depth_normal_sft.py:

transformer = CogVideoXTransformer3DModel.from_pretrained_modify(
    "anyeZHY/tesseract",
    subfolder="tesseract_v01e_rgbdn_sft",
    ...
)

LoRA Fine-tuning

You can efficiently fine-tune our TesserAct model using LoRA (Low-Rank Adaptation) with your own data (~100 videos). This approach requires approximately ~30GB GPU memory and allows for efficient training (~2 days) on custom datasets.

To fine-tune using LoRA, run the following command:

bash train_i2v_depth_normal_lora.sh

Warning

LoRA fine-tuning is experimental and not fully tested yet.

Note

We will give a detailed training guide in the future: why TesserAct has better generalization, how to set the hyperparameters and performance between different training methods (SFT vs LoRA).

We don't have a clear plan for releasing the whole dataset yet, because depth data is usually stored as floats, which takes up a lot of space and makes uploading to Hugging Face very difficult. However, we've provided scripts to show how to prepare the data.

Inference

Now TesserAct includes following models. The names of the models are in the format of anyeZHY/tesseract/ (huggingface repo name) + <model_name>_<version>_<modality>_<training_method>. In <version>, postfix p indicates the model is production-ready and e means the model is experimental. We will keep updating the model weights and scaling the dataset to improve the performance of the models.

anyeZHY/tesseract/tesseract_v01e_rgbdn_sft
anyeZHY/tesseract/tesseract_v01e_rgb_lora

Important

It is recommended to read USAGE.MD for more details before running the inference code on your own data. We provide a guide on how to prepare inputs, such as text prompt. We also analyze the model's limitations and performance, including:

  • Tasks that the model can reliably accomplish.

  • Tasks that are achievable but with certain success rates. In the future, this may be improved by using techniques like test-time scaling.

  • Tasks that are currently beyond the model's capabilities.

You can run the inference code with the following command (Optional flags: --memory_efficient).

python inference/inference_rgbdn_sft.py \
  --weights_path anyeZHY/tesseract/tesseract_v01e_rgbdn_sft \
  --image_path asset/images/fruit_vangogh.png \
  --prompt "pick up the apple google robot"

This inference code will generate a video of the google robot picking up the apple in the Van Gogh Painting. Try other prompts like pick up the pear Franka Emika Panda! Or asset/images/majo.jpg with prompt Move the cup near bottle Franka Emika Panda!

For RGB-only generation using the LoRA model, you can use:

python inference/inference_rgb_lora.py \
  --weights_path anyeZHY/tesseract/tesseract_v01e_rgb_lora \
  --image_path asset/images/fruit_vangogh.png \
  --prompt "pick up the apple google robot"

The RGB LoRA model offers the best generalization quality for RGB video generation, making it ideal for diverse robotic manipulation tasks.

For RGB+Depth+Normal generation using the LoRA model, you can use:

python inference/inference_rgbdn_lora.py \
  --base_weights_path anyeZHY/tesseract/tesseract_v01e_rgbdn_sft \
  --lora_weights_path ./your_local_lora_weights \
  --image_path asset/images/fruit_vangogh.png \
  --prompt "pick up the apple google robot"

You may find output videos in the results folder. Note: When we test the model on another server, the results are exactly the same as those we uploaded to GitHub. So if you find they are different and get unexpected results like noisy videos, please check your environment and the version of the packages you are using.

Warning

Because RT1 and Bridge normal data is generated by Temporal Marigold, sometimes normal outputs are not perfect. We are working on improving the data using NormalCrafter.

Point Cloud Rendering

You can render point clouds from RGBD videos using Blender. This requires:

  1. Download Blender 4.3+ from blender.org
  2. Install required packages following the PyBlend setup guide
  3. Run the rendering script:
blender-4.4.3/blender -b -P scripts/rendering_points.py -- \
  --combined_video ./results/val_0_pick_up_the_apple_google_robot_0.mp4 \
  --render_output ./results/rendered_results

The script will generate point cloud renders for each frame of the video, saved as PNG images in the specified output directory.

Below is a list of TODOs for the inference part.

  • LoRA inference code
  • Blender rendering code (check package PyBlend!)
  • Normal Integration

Citation

If you find our work useful, please consider citing:

@article{zhen2025tesseract,
  title={TesserAct: Learning 4D Embodied World Models}, 
  author={Haoyu Zhen and Qiao Sun and Hongxin Zhang and Junyan Li and Siyuan Zhou and Yilun Du and Chuang Gan},
  year={2025},
  eprint={2504.20995},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.20995}, 
}

Acknowledgements

We would like to thank the following works for their code and models:

We are extremely grateful to Pengxiao Han and Zixian Gao for assistance with the baseline code, and to Yuncong Yang, Sunli Chen, Jiaben Chen, Zeyuan Yang, Zixin Wang, Lixing Fang, and many other friends in our Embodied AGI Lab for their helpful feedback and insightful discussions.