Skip to content

zhijie-group/UniCMs

Repository files navigation

UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding

Chenkai Xu1*, Xu Wang1*, Zhenyi Liao1, Yishun Li2, Tianqi Hou3, Zhijie Deng1†
1Shanghai Jiao Tong University   2Huawei   3Tongji University
{132435xck,wangxu60,zhijied}@sjtu.edu.cn
*Equal contribution.   Corresponding author.

ArXiv   Hugging Face   Hugging Face


News


What's New about UniCMs?

UniCMs is a unified consistency model that solves the problem of low generation efficiency in unified models across various multimodal tasks. UniCMs significantly accelerates generation speed by using discrete diffusion for image modeling and introducing parallel decoding into autoregressive text modeling, thereby establishing a unified denoising perspective for both modalities. UniCMs achieves this goal through the following key innovations:



  • Unified Denoising: UniCMs leverages parallel text decoding techniques (Jacobi decoding) to reframe text generation as a denoising process, mirroring image generation. This creates a unified perspective where both modalities are treated as denoising trajectories.
  • Consistency Distillation: Inspired by acceleration techniques in diffusion models, UniCMs employs consistency distillation to significantly shorten these multimodal denoising trajectories. This results in much faster content generation.
  • Trajectory Segmentation and Curriculum Learning: To improve training convergence, UniCMs adopts a staged training approach. This involves progressively decreasing trajectory segments and incorporating curriculum learning.
  • Top-k Sampling: To enhance the quality of generated samples, particularly when using fewer sampling steps, UniCMs incorporates top-k sampling during the inference phase.

Results

UniCMs demonstrates substantial speed improvements in both text-to-image and image-to-text generation. Importantly, we are releasing models for both 256 and 512 resolutions.

512-Resolution Model

  • Text-to-Image Generation: The 512-resolution T2I of UniCMs.

  • Multimodal Understanding: The 512-resolution MMU of UniCMs.


Getting Started

First, create and activate the Conda environment:

conda create -n UniCMs python=3.8
conda activate UniCMs  # Activate the environment
cd UniCMs
pip3 install -r requirements.txt

Inference

Multimodal Understanding

# For 512-resolution model
sh inference_mmu_512.sh

# For 256-resolution model
sh inference_mmu_256.sh

Text-to-Image Generation

# For 512-resolution model
sh inference_t2i_512.sh

# For 256-resolution model
sh inference_t2i_256.sh


Training pipeline

sh train_script/train512.sh

TODO

  • Release inference and training code.
  • Release model weights.
  • Conduct further experiments with larger models and datasets.

Contributing

We warmly welcome contributions to UniCMs! If you have suggestions for new features or improvements, please open an issue or submit a pull request. Your contributions are highly appreciated!


Citation

@misc{xu2025unicmsunifiedconsistencymodel,
      title={UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding}, 
      author={Chenkai Xu and Xu Wang and Zhenyi Liao and Yishun Li and Tianqi Hou and Zhijie Deng},
      year={2025},
      eprint={2502.05415},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.05415}, 
}

Acknowledgments

We extend our sincere gratitude to the authors of Show-o and the developers of the essential libraries and frameworks that underpin UniCMs. This includes, but is not limited to: open-muse, Phi-1.5, maskgit, taming-transformers, transformers, accelerate, and diffusers. We deeply appreciate the invaluable contributions of all the authors.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published