Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized
This is a stripped and minimal dependencies repository for running locally or in production VQGAN+CLIP.
For a Google Colab notebook see the original repository.
Clone this repository and cd
inside.
git clone https://github.com/kcosta42/VQGAN-CLIP-Docker.git
cd VQGAN-CLIP-Docker
You can download a pretrained VQGAN model and put it in the ./models
folder.
Dataset | Link | Config |
---|---|---|
ImageNet (f=16), 16384 | vqgan_imagenet_f16_16384.ckpt | ./configs/models/vqgan_imagenet_f16_16384.json |
ImageNet (f=16), 1024 | vqgan_imagenet_f16_1024.ckpt | ./configs/models/vqgan_imagenet_f16_1024.json |
FacesHQ (f=16) | vqgan_faceshq_f16_1024.ckpt | ./configs/models/vqgan_faceshq_f16_1024.json |
COCO-Stuff (f=16) | vqgan_coco_f16_8192.ckpt | ./configs/models/vqgan_coco_f16_8192.json |
For GPU capability, make sure you have CUDA installed on your system (tested with CUDA 11.1+).
- 6 GB of VRAM is required to generate 256x256 images.
- 11 GB of VRAM is required to generate 512x512 images.
- 24 GB of VRAM is required to generate 1024x1024 images. (Untested)
Install the Python requirements
python3 -m pip install -r requirements.txt
To know if you can run this on your GPU, the following command must return True
.
python3 -c "import torch; print(torch.cuda.is_available());"
Make sure you have
docker
anddocker-compose
installed.nvidia-docker
is needed if you want to run this on your GPU through Docker.
A Makefile is provided for ease of use.
make build # Build the docker image
Two configuration files are provided ./configs/local.json
and ./configs/docker.json
. They are ready to go, but you may want to edit them to meet your need. Check the Configuration section to understand each field.
By default, the resulting generations can be found in the ./outputs
folder.
To run locally:
python3 -m scripts.generate -c ./configs/local.json
To run on docker:
make generate
To run locally:
DEVICE=cpu python3 -m scripts.generate -c ./configs/local.json
To run on docker:
make generate-cpu
Argument | Type | Descriptions |
---|---|---|
prompts |
List[str] | Text prompts |
image_prompts |
List[FilePath] | Image prompts / target image path |
max_iterations |
int | Number of iterations |
save_freq |
int | Save image iterations |
size |
[int, int] | Image size (width height) |
pixelart |
[int, int] | Pixelart image size (width height) (Optional, remove option to disable) |
init_image |
FilePath | Initial image |
init_noise |
str | Initial noise image ["gradient","pixels","fractal"] |
init_weight |
float | Initial weight |
mse_decay_rate |
int | Slowly decrease the MSE Loss each specified iterations until it reach about 0 |
output_dir |
FilePath | Path to output directory |
models_dir |
FilePath | Path to models cache directory |
clip_model |
FilePath | CLIP model path or name |
vqgan_checkpoint |
FilePath | VQGAN checkpoint path |
vqgan_config |
FilePath | VQGAN config path |
noise_prompt_seeds |
List[int] | Noise prompt seeds |
noise_prompt_weights |
List[float] | Noise prompt weights |
step_size |
float | Learning rate |
cutn |
int | Number of cuts |
cut_pow |
float | Cut power |
seed |
int | Seed (-1 for random seed) |
optimizer |
str | Optimiser ["Adam","AdamW","Adagrad","Adamax","DiffGrad","AdamP","RAdam"] |
nwarm_restarts |
int | Number of time the learning rate is reseted (-1 to disable LR decay) |
augments |
List[str] | Enabled augments ["Ji","Sh","Gn","Pe","Ro","Af","Et","Ts","Cr","Er","Re","Hf"] |
These are instructions to train a new VQGAN model. You can also finetunes the pretrained models but you may need to tweak the training script.
Two models configuration files are provided ./configs/models/vqgan_custom.json
and ./configs/models/vqgan_custom_docker.json
. They are ready to go, but you may want to edit them to meet your need. Check the Model Configuration to understand each field.
By default, the models are saved in the ./models/checkpoints
folder.
Put your image in a folder inside the data directory (./data
by default).
The dataset must be structured as follow:
./data/
├── class_x/
│ ├── xxx.png
│ ├── xxy.jpg
│ └── ...
│ └── xxz.ppm
└── class_y/
├── 123.bmp
├── nsdf3.tif
└── ...
└── asd932_.webp
To run locally:
python3 -m scripts.train -c ./configs/models/vqgan_custom.json
To run on docker:
make train
To run locally:
DEVICE=cpu python3 -m scripts.train -c ./configs/models/vqgan_custom.json
To run on docker:
make train-cpu
Argument | Type | Descriptions |
---|---|---|
base_learning_rate |
float | Initial Learning rate |
batch_size |
int | Batch size (Adjust based on your GPU capability) |
epochs |
int | Maximum number of epoch |
output_dir |
FilePath | Path to directory where to save training images |
models_dir |
FilePath | Path to directory where to save the model |
data_dir |
FilePath | Path to data directory |
seed |
int | Seed (-1 for random seed) |
resume_checkpoint |
FilePath | Path to pretrained model |
- Let the Generator train without the Discriminator for a few epochs (~3-5 epochs for ImageNet), then enable the Discriminator.
The variablelossconfig.params.disc_start
correspond to the number of global step (ie. batch iterations) before enabling the Discriminator. - Once enabled, the Discriminator loss will stagnate around ~1.0, this is a normal behaviour. The loss will decrease in later epochs. (It can take a very long time).
- If you've enabled the Discriminator too soon, the Generator will take a lot more time to train.
- Basically there is no rules for the number of epochs. If your dataset is large enough, there is no risk of overfitting. So the more you train, the better.
@misc{unpublished2021clip,
title = {CLIP: Connecting Text and Images},
author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
year = {2021}
}
@misc{esser2020taming,
title={Taming Transformers for High-Resolution Image Synthesis},
author={Patrick Esser and Robin Rombach and Björn Ommer},
year={2020},
eprint={2012.09841},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{ramesh2021zeroshot,
title = {Zero-Shot Text-to-Image Generation},
author = {Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever},
year = {2021},
eprint = {2102.12092},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}