generative-model clip text-encoding text-embedding text-to-image image-embedding dall-e

DALLE_clip_score

Simple script to compute CLIP scores based on a trained DALL-e model, using OpenAI's CLIP https://github.com/openai/CLIP. CLIP scores measures the compatibility between an image and a caption. The raw value is using cosine similarity, so it is between -1 and 1. In CLIP, the value is scaled by 100 by default, giving a number between -100 and 100, where 100 means maximum compatibility between an image and text. As mentioned in https://arxiv.org/abs/2104.14806, it is rare that the score is negative, but we clamp it to have a number between 0 and 100 anyways. Typical values are around 20-30.

How to install ?

Install CLIP from https://github.com/openai/CLIP
Install DALL-E lucidrains implementation https://github.com/lucidrains/DALLE-pytorch
python setup.py install

How to use ?

Here is an example:

clip_score --dalle_path dalle.pt --image_text_folder CUB_200_2011 --taming --num_generate 1 --dump

here:

dalle_path is the path of the model trained with DALL-E using https://github.com/lucidrains/DALLE-pytorch
image_text_folder is the folder of the dataset following https://github.com/lucidrains/DALLE-pytorch/loader.py format
taming: specify that we use taming transformers as an image encoder
num_generate: number of images to generate per caption
dump: save all the generated images in the folder outputs (by default) and their respective metrics

Example output:

CLIP_score_real 30.1826171875
CLIP_score 26.7392578125
CLIP_score_top1 26.7392578125
CLIP_score_relative 0.8892822265625
CLIP_score_relative_top1 0.8892822265625
CLIP_atleast 0.7466491460800171

Note that all the metrics will also be saved on clip_score.json by default.

CLIP_score_real: average CLIP score for real images
CLIP_score: average CLIP score for all generated images.
CLIP_score_top1: for each caption, retain the generated image with best CLIP score, then compute the average CLIP score like in CLIP_score.
CLIP_score_relative: similar to https://arxiv.org/abs/2104.14806, we compute CLIP score of the generated image divided by the CLIP score of the real image, then average. In general, between 0 and 1, although it can be bigger than 1. Bigger than 1 means the CLIP score of the generated image is higher.
CLIP_score_relative_top1: same as CLIP_score_relative but using the top CLIP score like in CLIP_score_top1.
CLIP_atleast: for each caption, it is 1 if CLIP score can reach at least --clip_thresh (by default 25), 0 if not, then we average over all captions. This score gives a number between 0 and 1.

For all scores, the higher, the better.

About

Simple script to compute CLIP-based scores given a DALL-e trained model.

generative-model clip text-encoding text-embedding text-to-image image-embedding dall-e

Languages

Language:Python 100.0%