Text-to-Image Multimodal Generator

This project implements a text-to-image generator using two pre-trained models, VQGAN and CLIP, to produce high-quality images from text prompts.

Overview

This project utilizes two state-of-the-art models, plus a custom one:

CLIP (OpenAI): Encodes text prompts and images into the same latent space for comparison (loss calculation).
VQGAN (CompVis): Generates high-quality images using a combination of GANs and transformers.
Parameters Class: Custom module used to adjust and optimize parameters during training.

The Multimodal Generator works by using the CLIP's enconding capabilities to bring the text inputs and generated images from VQGAN to a same latent space, which allow to calculate a loss and then optimize the VQGAN output. The model was built using Google Colab to leverage it GPU's availabilities.

Installation

Instructions on how to install the project locally.

# Clone this repository
$ git clone https://github.com/eduardotakemura/text-to-image-gen.git

# Go into the repository
$ cd text-to-image-gen

# Install dependencies
$ pip install -r requirements.txt

Usage

This model architecture is based on an on-the-fly training, which can be consider a limitation, for each input the model learn parameters to represent it with a lower loss. So we need to run a training for every new input. The process can be summarize as:

Input three text-prompts components: include = means what the model will include in the output image, exclude = what it will exclude from it, and extras = which will be an extra prompt that the model will take in account.
Text-prompts will be encoded by CLIP, as the images generated by VQGAN;
For a better generalization, the images are augmented and cropped before being pass to CLIP, allowing a better understanding, since we're using single images;
With the encondings, the Parameters class calculate the losses and optimize it parameters, repeting for a number of iterations;
So we don't train either CLIP or VQGAN;

Libraries

This project relies most on the following Python libraries:

CLIP: For enconding and processing text and images;
taming-transformers: Pre-trained GAN-transformers;
PyTorch: For building and training the models;
torchvision: For image transformations and augmentations.
Numpy: For numerical operations and handling tensors.
Pillow: For image handling and manipulations;
imageio: For reading and writing image data;

Acknowledgments

This project was inspired by the course "Generative AI, from GANs to CLIP, with Python and Pytorch" by Javier Ideami, credits are due the author;
All models credits go to their respective authors:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
Text_to_Image_Multimodal_Generation_CLIP_+_VQGAN.ipynb		Text_to_Image_Multimodal_Generation_CLIP_+_VQGAN.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text-to-Image Multimodal Generator

Table of Contents

Overview

Installation

Usage

Libraries

Acknowledgments

About

Uh oh!

Languages

License

eduardotakemura/text-to-image-generator

Folders and files

Latest commit

History

Repository files navigation

Text-to-Image Multimodal Generator

Table of Contents

Overview

Installation

Usage

Libraries

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages