Chenkai Xu1*, Xu Wang1*, Zhenyi Liao1, Yishun Li2, Tianqi Hou3, Zhijie Deng1†
1Shanghai Jiao Tong University 2Huawei 3Tongji University
{132435xck,wangxu60,zhijied}@sjtu.edu.cn
*Equal contribution. †Corresponding author.
- [2024-11-29] We release a 256-resolution version of the weights for UniCMs on Hugging Face.
- [2025-2-12] We release a 512-resolution version of the weights for UniCMs on Hugging Face.
UniCMs is a unified consistency model that solves the problem of low generation efficiency in unified models across various multimodal tasks. UniCMs significantly accelerates generation speed by using discrete diffusion for image modeling and introducing parallel decoding into autoregressive text modeling, thereby establishing a unified denoising perspective for both modalities. UniCMs achieves this goal through the following key innovations:
- Unified Denoising: UniCMs leverages parallel text decoding techniques (Jacobi decoding) to reframe text generation as a denoising process, mirroring image generation. This creates a unified perspective where both modalities are treated as denoising trajectories.
- Consistency Distillation: Inspired by acceleration techniques in diffusion models, UniCMs employs consistency distillation to significantly shorten these multimodal denoising trajectories. This results in much faster content generation.
- Trajectory Segmentation and Curriculum Learning: To improve training convergence, UniCMs adopts a staged training approach. This involves progressively decreasing trajectory segments and incorporating curriculum learning.
- Top-k Sampling: To enhance the quality of generated samples, particularly when using fewer sampling steps, UniCMs incorporates top-k sampling during the inference phase.
UniCMs demonstrates substantial speed improvements in both text-to-image and image-to-text generation. Importantly, we are releasing models for both 256 and 512 resolutions.
-
Text-to-Image Generation: The 512-resolution T2I of UniCMs.
-
Multimodal Understanding: The 512-resolution MMU of UniCMs.
First, create and activate the Conda environment:
conda create -n UniCMs python=3.8
conda activate UniCMs # Activate the environment
cd UniCMs
pip3 install -r requirements.txt
# For 512-resolution model
sh inference_mmu_512.sh
# For 256-resolution model
sh inference_mmu_256.sh
# For 512-resolution model
sh inference_t2i_512.sh
# For 256-resolution model
sh inference_t2i_256.sh
sh train_script/train512.sh
- Release inference and training code.
- Release model weights.
- Conduct further experiments with larger models and datasets.
We warmly welcome contributions to UniCMs! If you have suggestions for new features or improvements, please open an issue or submit a pull request. Your contributions are highly appreciated!
@misc{xu2025unicmsunifiedconsistencymodel,
title={UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding},
author={Chenkai Xu and Xu Wang and Zhenyi Liao and Yishun Li and Tianqi Hou and Zhijie Deng},
year={2025},
eprint={2502.05415},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.05415},
}
We extend our sincere gratitude to the authors of Show-o and the developers of the essential libraries and frameworks that underpin UniCMs. This includes, but is not limited to: open-muse, Phi-1.5, maskgit, taming-transformers, transformers, accelerate, and diffusers. We deeply appreciate the invaluable contributions of all the authors.