This project implements a text-to-image generator using two pre-trained models, VQGAN and CLIP, to produce high-quality images from text prompts.
This project utilizes two state-of-the-art models, plus a custom one:
- CLIP (OpenAI): Encodes text prompts and images into the same latent space for comparison (loss calculation).
- VQGAN (CompVis): Generates high-quality images using a combination of GANs and transformers.
- Parameters Class: Custom module used to adjust and optimize parameters during training.
The Multimodal Generator works by using the CLIP's enconding capabilities to bring the text inputs and generated images from VQGAN to a same latent space, which allow to calculate a loss and then optimize the VQGAN output. The model was built using Google Colab to leverage it GPU's availabilities.
Instructions on how to install the project locally.
# Clone this repository
$ git clone https://github.com/eduardotakemura/text-to-image-gen.git
# Go into the repository
$ cd text-to-image-gen
# Install dependencies
$ pip install -r requirements.txt
This model architecture is based on an on-the-fly training, which can be consider a limitation, for each input the model learn parameters to represent it with a lower loss. So we need to run a training for every new input. The process can be summarize as:
- Input three text-prompts components: include = means what the model will include in the output image, exclude = what it will exclude from it, and extras = which will be an extra prompt that the model will take in account.
- Text-prompts will be encoded by CLIP, as the images generated by VQGAN;
- For a better generalization, the images are augmented and cropped before being pass to CLIP, allowing a better understanding, since we're using single images;
- With the encondings, the Parameters class calculate the losses and optimize it parameters, repeting for a number of iterations;
- So we don't train either CLIP or VQGAN;
This project relies most on the following Python libraries:
- CLIP: For enconding and processing text and images;
- taming-transformers: Pre-trained GAN-transformers;
- PyTorch: For building and training the models;
- torchvision: For image transformations and augmentations.
- Numpy: For numerical operations and handling tensors.
- Pillow: For image handling and manipulations;
- imageio: For reading and writing image data;
- This project was inspired by the course "Generative AI, from GANs to CLIP, with Python and Pytorch" by Javier Ideami, credits are due the author;
- All models credits go to their respective authors: