Skip to content

[ICCV 2025] MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

License

Notifications You must be signed in to change notification settings

zju3dv/MotionStreamer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao1 · Shunlin Lu 2 · Huaijin Pi3 · Ke Fan4 · Liang Pan3 · Yueer Zhou1 · Ziyong Feng5 ·
Xiaowei Zhou1 · Sida Peng1† · Jingbo Wang6

1Zhejiang University 2The Chinese University of Hong Kong, Shenzhen 3The University of Hong Kong
4Shanghai Jiao Tong University 5DeepGlint 6Shanghai AI Lab
ICCV 2025

image

🔥 News

  • [2025-06] MotionStreamer has been accepted to ICCV 2025! 🎉

TODO List

  • Release the processing script of 272-dim motion representation.
  • Release the processed 272-dim Motion Representation of HumanML3D dataset. Only for academic usage.
  • Release the training code and checkpoint of our TMR-based motion evaluator trained on the processed 272-dim HumanML3D dataset.
  • Release the training and evaluation code as well as checkpoint of Causal TAE.
  • Release the training code of original motion generation model and streaming generation model (MotionStreamer).
  • Release the checkpoint and demo inference code of original motion generation model.
  • Release complete code for MotionStreamer.

🏃 Motion Representation

For more details of how to obtain the 272-dim motion representation, as well as other useful tools (e.g., Visualization and Conversion to BVH format), please refer to our GitHub repo.

Installation

🐍 Python Virtual Environment

conda env create -f environment.yaml
conda activate mgpt

🤗 Hugging Face Mirror

Since all of our models and data are available on Hugging Face, if Hugging Face is not directly accessible, you can use the HF-mirror tools following:

pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com

📥 Data Preparation

To facilitate researchers, we provide the processed 272-dim Motion Representation of:

HumanML3D dataset at this link.

BABEL dataset at this link.

❗️❗️❗️ The processed data is solely for academic purposes. Make sure you read through the AMASS License.

  1. Download the processed 272-dim HumanML3D dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-HumanML3D --local-dir ./humanml3d_272
cd ./humanml3d_272
unzip texts.zip
unzip motion_data.zip

The dataset is organized as:

./humanml3d_272
  ├── mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
      ├── test.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...
  1. Download the processed 272-dim BABEL dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL --local-dir ./babel_272
cd ./babel_272
unzip texts.zip
unzip motion_data.zip

The dataset is organized as:

./babel_272
  ├── t2m_babel_mean_std
      ├── Mean.npy
      ├── Std.npy
  ├── split
      ├── train.txt
      ├── val.txt
  ├── texts
      ├── 000000.txt
      ...
  ├── motion_data
      ├── 000000.npy
      ...
  1. Download the processed streaming 272-dim BABEL dataset following:
huggingface-cli download --repo-type dataset --resume-download lxxiao/272-dim-BABEL-stream --local-dir ./babel_272_stream
cd ./babel_272_stream
unzip train_stream.zip
unzip train_stream_text.zip
unzip val_stream.zip
unzip val_stream_text.zip

The dataset is organized as:

./babel_272_stream
  ├── train_stream
      ├── seq1.npy
      ...
  ├── train_stream_text
      ├── seq1.txt
      ...
  ├── val_stream
      ├── seq1.npy
      ...
  ├── val_stream_text
      ├── seq1.txt
      ...

NOTE: We process the original BABEL dataset to support training of streaming motion generation. e.g. If there is a motion sequence A, annotated as (A1, A2, A3, A4) in BABEL dataset, each subsequence has text description: (A1_t, A2_t, A3_t, A4_t).

Then, our BABEL-stream is constructed as:

seq1: (A1, A2) --- seq1_text: (A1_t*A2_t#A1_length)

seq2: (A2, A3) --- seq2_text: (A2_t*A3_t#A2_length)

seq3: (A3, A4) --- seq3_text: (A3_t*A4_t#A3_length)

Here, * and # is separation symbol, A1_length means the number of frames of subsequence A1.

🚀 Training

  1. Train our TMR-based motion evaluator on the processed 272-dim HumanML3D dataset:

    bash TRAIN_evaluator_272.sh

    After training for 100 epochs, the checkpoint will be stored at: Evaluator_272/experiments/temos/EXP1/checkpoints/.

    ⬇️ We provide the evaluator checkpoint on Hugging Face, download it following:

    python humanml3d_272/prepare/download_evaluator_ckpt.py

    The downloaded checkpoint will be stored at: Evaluator_272/.

  2. Train the Causal TAE:

    bash TRAIN_causal_TAE.sh ${NUM_GPUS}

    e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8

    The checkpoint will be stored at: Experiments/causal_TAE_t2m_272/

    Tensorboard visualization:

    tensorboard --logdir='Experiments/causal_TAE_t2m_272'

    ⬇️ We provide the Causal TAE checkpoint on Hugging Face, download it following:

    python humanml3d_272/prepare/download_Causal_TAE_t2m_272_ckpt.py
  3. Train text to motion model:

    We provide scripts to train the original text to motion generation model with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (trained in the first stage).

    3.1 Get motion latents:

    python get_latent.py --resume-pth Causal_TAE/net_last.pth --latent_dir humanml3d_272/t2m_latents

    3.2 Download sentence-T5-XXL model on Hugging Face:

    huggingface-cli download --resume-download sentence-transformers/sentence-t5-xxl --local-dir sentencet5-xxl/

    3.3 Train text to motion generation model:

    bash TRAIN_t2m.sh ${NUM_GPUS}

    e.g., if you have 8 GPUs, run: bash TRAIN_t2m.sh 8

    The checkpoint will be stored at: Experiments/t2m_model/

    Tensorboard visualization:

    tensorboard --logdir='Experiments/t2m_model'

    ⬇️ We provide the text to motion model checkpoint on Hugging Face, download it following:

    python humanml3d_272/prepare/download_t2m_model_ckpt.py
  4. Train streaming motion generation model (MotionStreamer):

    We provide scripts to train the streaming motion generation model (MotionStreamer) with llama blocks, Two-Forward strategy and QK-Norm, using the motion latents encoded by the Causal TAE (need to train a new Causal TAE using both HumanML3D-272 and BABEL-272 data).

    4.1 Train a Causal TAE using both HumanML3D-272 and BABEL-272 data:

    bash TRAIN_causal_TAE.sh ${NUM_GPUS} t2m_babel_272

    e.g., if you have 8 GPUs, run: bash TRAIN_causal_TAE.sh 8 t2m_babel_272

    The checkpoint will be stored at: Experiments/causal_TAE_t2m_babel_272/

    Tensorboard visualization:

    tensorboard --logdir='Experiments/causal_TAE_t2m_babel_272'

    ⬇️ We provide the Causal TAE checkpoint trained using both HumanML3D-272 and BABEL-272 data on Hugging Face, download it following:

    python humanml3d_272/prepare/download_Causal_TAE_t2m_babel_272_ckpt.py

    4.2 Get motion latents of both HumanML3D-272 and the processed BABEL-272-stream dataset:

    python get_latent.py --resume-pth Causal_TAE_t2m_babel/net_last.pth --latent_dir babel_272_stream/t2m_babel_latents --dataname t2m_babel_272

    4.3 Train MotionStreamer model:

    bash TRAIN_motionstreamer.sh ${NUM_GPUS}

    e.g., if you have 8 GPUs, run: bash TRAIN_motionstreamer.sh 8

    The checkpoint will be stored at: Experiments/motionstreamer_model/

    Tensorboard visualization:

    tensorboard --logdir='Experiments/motionstreamer_model'

📍 Evaluation

  1. Evaluate the metrics of the processed 272-dim HumanML3D dataset:

    bash EVAL_GT.sh

    ( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )

  2. Evaluate the metrics of Causal TAE:

    bash EVAL_causal_TAE.sh

    ( FID and MPJPE (mm) are reported. )

  3. Evaluate the metrics of text to motion model:

    bash EVAL_t2m.sh

    ( FID, R@1, R@2, R@3, Diversity and MM-Dist (Matching Score) are reported. )

🎬 Demo Inference

  1. Inference of text to motion model:

    [Option1] Recover from joint position

    python demo_t2m.py --text 'a person is walking like a mummy.' --mode pos --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth

    [Option2] Recover from joint rotation

    python demo_t2m.py --text 'a person is walking like a mummy.' --mode rot --resume-pth Causal_TAE/net_last.pth --resume-trans Experiments/t2m_model/latest.pth

    In our 272-dim representation, Inverse Kinematics (IK) is not needed. For further conversion to BVH format, please refer to this repo (Step 6: Representation_272 to BVH conversion). The BVH format of motion animation can be visualizd and edited in Blender.

🌹 Acknowledgement

This repository builds upon the following awesome datasets and projects:

🤝🏼 Citation

If our project is helpful for your research, please consider citing :

@article{xiao2025motionstreamer,
      title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
      author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
      journal={arXiv preprint arXiv:2503.15451},
      year={2025}
    }

Star History

Star History Chart

About

[ICCV 2025] MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published