Jinbo Xing, Menghan Xia*, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu,
Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong
(* corresponding author)
From CUHK and Tencent AI Lab.
IEEE TVCG 2024
Make-Your-Video is a customized video generation model with both text and motion structure (depth) control. It inherits rich visual concepts from image LDM and supports longer video inference.
Real-life scene | Ours | Text2Video-zero+CtrlNet | LVDMExt+Adapter |
"A dam discharging water" | |||
"A futuristic rocket ship on a launchpad, with sleek design, glowing lights" |
- [2023.11.30]: 🔥🔥 Release the main model.
- [2023.06.01]: 🔥🔥 Create this repo and launch the project webpage.
Model | Resolution | Checkpoint |
---|---|---|
MakeYourVideo256 | 256x256 | Hugging Face |
It takes approximately 13 seconds and requires a peak GPU memory of 20 GB to animate an image using a single NVIDIA A100 (40G) GPU.
conda create -n makeyourvideo python=3.8.5
conda activate makeyourvideo
pip install -r requirements.txt
- Download the pre-trained depth estimation model from Hugging Face, and put the
dpt_hybrid-midas-501f0c75.pt
incheckpoints/depth/dpt_hybrid-midas-501f0c75.pt
. - Download pretrained models via Hugging Face, and put the
model.ckpt
incheckpoints/makeyourvideo_256_v1/model.ckpt
. - Input the following commands in terminal.
sh scripts/run.sh
VideoCrafter1: Framework for high-quality video generation.
DynamiCrafter: Open-domain image animation methods using video diffusion priors.
Play with these projects in the same conda environement!
@article{xing2023make,
title={Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance},
author={Xing, Jinbo and Xia, Menghan and Liu, Yuxin and Zhang, Yuechen and Zhang, Yong and He, Yingqing and Liu, Hanyuan and Chen, Haoxin and Cun, Xiaodong and Wang, Xintao and others},
journal={arXiv preprint arXiv:2306.00943},
year={2023}
}
We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.
We gratefully acknowledge the Visual Geometry Group of University of Oxford for collecting the WebVid-10M dataset and follow the corresponding terms of access.