GitHub - zhijie-group/UniCMs

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

Chenkai Xu^1*, Xu Wang^1*, Zhenyi Liao¹, Yishun Li², Tianqi Hou³, Zhijie Deng^1†
¹Shanghai Jiao Tong University ²Huawei ³Tongji University
{132435xck,wangxu60,zhijied}@sjtu.edu.cn
^*Equal contribution. ^†Corresponding author.

News

[2024-11-29] We release a 256-resolution version of the weights for UniCMs on Hugging Face.
[2025-2-12] We release a 512-resolution version of the weights for UniCMs on Hugging Face.

What's New about UniCMs?

UniCMs is a unified consistency model that solves the problem of low generation efficiency in unified models across various multimodal tasks. UniCMs significantly accelerates generation speed by using discrete diffusion for image modeling and introducing parallel decoding into autoregressive text modeling, thereby establishing a unified denoising perspective for both modalities. UniCMs achieves this goal through the following key innovations:

Unified Denoising: UniCMs leverages parallel text decoding techniques (Jacobi decoding) to reframe text generation as a denoising process, mirroring image generation. This creates a unified perspective where both modalities are treated as denoising trajectories.
Consistency Distillation: Inspired by acceleration techniques in diffusion models, UniCMs employs consistency distillation to significantly shorten these multimodal denoising trajectories. This results in much faster content generation.
Trajectory Segmentation and Curriculum Learning: To improve training convergence, UniCMs adopts a staged training approach. This involves progressively decreasing trajectory segments and incorporating curriculum learning.
Top-k Sampling: To enhance the quality of generated samples, particularly when using fewer sampling steps, UniCMs incorporates top-k sampling during the inference phase.

Results

UniCMs demonstrates substantial speed improvements in both text-to-image and image-to-text generation. Importantly, we are releasing models for both 256 and 512 resolutions.

512-Resolution Model

Text-to-Image Generation: The 512-resolution T2I of UniCMs.
Multimodal Understanding: The 512-resolution MMU of UniCMs.

Getting Started

First, create and activate the Conda environment:

conda create -n UniCMs python=3.8
conda activate UniCMs  # Activate the environment
cd UniCMs
pip3 install -r requirements.txt

Inference

Multimodal Understanding

# For 512-resolution model
sh inference_mmu_512.sh

# For 256-resolution model
sh inference_mmu_256.sh

Text-to-Image Generation

# For 512-resolution model
sh inference_t2i_512.sh

# For 256-resolution model
sh inference_t2i_256.sh

Training pipeline

sh train_script/train512.sh

TODO

Release inference and training code.
Release model weights.
Conduct further experiments with larger models and datasets.

Contributing

We warmly welcome contributions to UniCMs! If you have suggestions for new features or improvements, please open an issue or submit a pull request. Your contributions are highly appreciated!

Citation

@misc{xu2025unicmsunifiedconsistencymodel,
      title={UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding}, 
      author={Chenkai Xu and Xu Wang and Zhenyi Liao and Yishun Li and Tianqi Hou and Zhijie Deng},
      year={2025},
      eprint={2502.05415},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.05415}, 
}

Acknowledgments

We extend our sincere gratitude to the authors of Show-o and the developers of the essential libraries and frameworks that underpin UniCMs. This includes, but is not limited to: open-muse, Phi-1.5, maskgit, taming-transformers, transformers, accelerate, and diffusers. We deeply appreciate the invaluable contributions of all the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
config		config
docs		docs
example		example
llava		llava
models		models
parquet		parquet
train_script		train_script
training		training
README.md		README.md
inference_mmu_256.sh		inference_mmu_256.sh
inference_mmu_512.sh		inference_mmu_512.sh
inference_t2i_256.sh		inference_t2i_256.sh
inference_t2i_512.sh		inference_t2i_512.sh
requirements.txt		requirements.txt
sample_mmu.py		sample_mmu.py
sample_t2i.py		sample_t2i.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

News

What's New about UniCMs?

Results

512-Resolution Model

Getting Started

Inference

Multimodal Understanding

Text-to-Image Generation

Training pipeline

TODO

Contributing

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

zhijie-group/UniCMs

Folders and files

Latest commit

History

Repository files navigation

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

News

What's New about UniCMs?

Results

512-Resolution Model

Getting Started

Inference

Multimodal Understanding

Text-to-Image Generation

Training pipeline

TODO

Contributing

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages