🎆🎆🎆 Visit our online demo here.
- 2024.04.07: Release the code of ModaVerse with version
ModaVerse-7b-v0
- Customize Diffusion Model Zoo
- Add step-by-step data preparation instructions & instrution dataset
- Training with custom data
- Instruction generation scripts
- Updating ModaVerse in versions with different setting
conda create -n modaverse python=3.9 -y
conda activate modaverse
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
git clone --recursive https://github.com/xinke-wang/ModaVerse.git
cd ModaVerse
pip install -r requirements.txt
pip install -e .
rm -rf ImageBind/requirements.txt
cp requirements.txt ImageBind/requirements.txt
cd ImageBind
pip install -e .
cd ..
mkdir .checkpoints && cd .checkpoints
Follow these instructions to obtain and apply Vicuna's 7b-v0
delta weights to the LLaMA pretrained model.
Then, download the ModaVerse pretrained model from one of the following sources:
Model | Foundation LLM | HuggingFace | GoogleDrive | Box |
---|---|---|---|---|
ModaVerse-7b-v0 | Vicuna-7b-V0 | Model | Model | Model |
ModaVerse-chat | Coming Soon |
Next, manually download the ImageBind model, or it will be automatically downloaded to .checkpoints/
when running the ModaVerse code. Finally, place all the weights in the .checkpoints/
folder, following the structure below:
.checkpoints/
├── 7b_v0
│ ├── config.json
│ ├── generation_config.json
│ ├── model-00001-of-00003.safetensors
│ ├── model-00002-of-00003.safetensors
│ ├── model-00003-of-00003.safetensors
│ ├── model.safetensors.index.json
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ └── tokenizer.model
├── imagebind_huge.pth
└── ModaVerse-7b
├── added_tokens.json
├── config.json
├── config.py
├── pytorch_model.pt
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.model
A simple example of using the model is as follows:
from modaverse.api import ModaVerseAPI
ModaVerse = ModaVerseAPI()
# Only Text Instruction
text_instruction = 'Please generate an audio that a dog is barking.'
ModaVerse(text_instruction)
# With Multi-modal Input
text_instruction = 'Please generate an audio of the sound for the animal in the image.'
ModaVerse(text_instruction, ['assets/media/image/cat.jpg'])
The output is saved in the output
folder by default.
Running inference with fully equipped generators for three-modality diffusion models may require at least 40 GB of GPU memory. If you lack sufficient memory, consider setting meta_response_only=True
to receive only the meta response from the model. And customize the parser and generator to fit your needs.
ModaVerse = ModaVerseAPI(meta_response_only=True)
python demo.py
If you find ModaVerse useful in your research or applications, please consider cite:
@article{wang2024modaverse,
title={ModaVerse: Efficiently Transforming Modalities with LLMs},
author={Wang, Xinyu and Zhuang, Bohan and Wu, Qi},
journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
We would like to thank the authors of the following repositories for their valuable contributions: ImageBind, MiniGPT-4, Vicuna, Stable Diffusion, AudioLDM, NextGPT, VideoFusion