indropal/GenerativeDeepLearningwithMultimodality

clip generative-adversarial-network generative-art text-to-image torch torchvision

Generative Deep Learning with Multiple modalities

This project explores Generative Deep Learning techniques to inter-relate data from multiple modelities i.e. text, image, speech etc. The projecct makes use of a State-of-The-Art (SOTA) generative Deep Learning architectures which are arranged in a complementary fashion kept in a feedback loop.

This project makes use of the following SOTA Deep-Learning architectures to generatively convert text-prompts / sentences into images:

OpenAI's Contrastive Language–Image Pre-training (CLIP) — to use learnt visual representations linked with natural language.
Vector-Quantized Generative Adversarial Networks (VQ-GAN) — to synthesize high quality images.

Example of Interpolation of generated images with 'Sfumato' art effect generated by the proposed Generative DL architecture in this project

A brief explanation of the approach taken to build the Generative Deep Learning architecture

We use CLIP to tokenize and encode the provided text prompts — relating to CLIP's learnt visual representations. The process of image generation begins with an initialised noisy image which contians no visual information, which is of pre-set dimensions (user-defined dimensions are 400 pixels X 400 pixels) and is augmented, cropped and (each of the augmented crops are stacked to provide visual context ~ information to CLIP) also tokenized by CLIP. The respective encodings of the image (starting out as noisy) and text prompt are evaluted by Cosine-Similarity — to quantify the similarity between the encodings (i.e. how close they are in terms of representative context) which will help to obtain a performance indicator / Loss of the entire Architecture. The objective in doing this is to ensure that encodings of both text & image match or are as possibly close to each other. These obtained encodings are then mapped to a learnt Latent Space of the VQ-GAN to generate the required image. Thus, the overall architecture involves having both CLIP & VQ-GAN in a feedback loop where the generated image (which is initialized as noisy) is refined over succcessive iterations — with respective encodings of the generated image & text-prompt become closer to eachother, thus minimizing the evaluated loss.

Snapshot of different images generated from various sentence prompts

Here are some of the images generated by the generative architecture presented in this project:

"A Garden of Words."

"A group of happy children."

"Sunny day with blue sky."

"Forest with purple trees."

About

Convert Text Prompts to Image with sophisticated CLIP & GAN model architectures.

clip generative-adversarial-network generative-art text-to-image torch torchvision

MIT License

Languages

Language:Jupyter Notebook 100.0%