Nyandwi/t2i_metrics

VQAScore for Text-to-Image Evaluation [Project Page]

TODO: Pick better teaser images because VQAScore still fail on current Winoground sample

VQAScore for Text-to-Image Evaluation

Quick start

Install the package via:

git clone https://github.com/linzhiqiu/t2i_metrics
cd t2i_metrics

conda create -n t2i python=3.10 -y
conda activate t2i
conda install pip -y

pip install torch torchvision torchaudio
pip install -e .

(not yet implemented) Or simply run pip install t2i_metrics.

The following Python code is all you need to evaluate the similiarity between an image and a text (higher means semantically closer).

import t2i_metrics
clip_flant5_score = t2i_metrics.VQAScore(model='clip-flant5-xxl') # our best scoring model

# For a single (image, text) pair
image = "images/test0.jpg" # an image path in string format
text = "a young person kisses an old person"
score = clip_flant5_score(images=[image], texts=[text])

# Alternatively, if you want to calculate the pairwise similarity scores 
# between M images and N texts, run the following to return a M x N score tensor.
images = ["images/test0.jpg", "images/test1.jpg"]
texts = ["an old person kisses a young person", "a young person kisses an old person"]
scores = clip_flant5_score(images=images, texts=texts) # scores[i][j] is the score between image i and text j

Notes on GPU and cache

GPU usage: The above scripts will by default use the first cuda device on your machine. We recommend using 40GB GPU for the largest VQA models such as clip-flant5-xxl and llava-v1.5-13b. If you have limited GPU memory, consider using smaller models such as clip-flant5-xl and llava-v1.5-7b.
Cache directory: You can change the cache folder (default is ./hf_cache/) by updating HF_CACHE_DIR in t2i_metrics/constants.py.

Advanced Usage

Batch processing for more image-text pairs

If you have a large dataset of M images x N texts, then you can optionally speed up inference using the following batch processing script.

import t2i_metrics
clip_flant5_score = t2i_metrics.VQAScore(model='clip-flant5-xxl')

# The number of images and texts per dictionary must be consistent.
# E.g., the below example shows how to evaluate 4 generated images per text
dataset = [
  {'images': ["images/sdxl_0.jpg", "images/dalle3_0.jpg", "images/deepfloyd_0.jpg", "images/imagen2_0.jpg"], 'texts': ["an old person kisses a young person"]},
  {'images': ["images/sdxl_1.jpg", "images/dalle3_1.jpg", "images/deepfloyd_1.jpg", "images/imagen2_1.jpg"], 'texts': ["a young person kissing an old person"]},
  #...
]
scores = clip_flant5_score.batch_forward(dataset=dataset, batch_size=16) # (n_sample, 4, 1) tensor

Specifying your own question and answer template for VQAScore

For VQAScore, the question and answer can affect the final performance. We provide a simple default template for each model by default. For example, CLIP-FlanT5 and LLaVA-1.5 uses the below template which can be found at t2i_metrics/models/vqascore_models/clip_t5_model.py (we ignored the prepended system message for simplicity):

# {} will be replaced by the caption
default_question_template = "Is the image showing '{}'? Please answer yes or no."
default_answer_template = "Yes"

You can specify your own template by passing in question_template and answer_template to forward() or batch_forward() function:

# An alternative template for VQAScore
question_template = "Does the image show '{}'? Please answer yes or no."
answer_template = "Yes"

scores = clip_flant5_score(images=images,
                           texts=texts,
                           question_template=question_template,
                           answer_template=answer_template)

You can also compute P(caption | image) (VisualGPTScore) instead of P(answer | image, question):

vgpt_question_template = "" # no question
vgpt_answer_template = "{}" # simply calculate the P(caption)

scores = clip_flant5_score(images=images,
                           texts=texts,
                           question_template=vgpt_question_template,
                           answer_template=vgpt_answer_template)

Check all supported models

We currently support CLIP-FlanT5, LLaVA-1.5, and InstructBLIP for VQAScore. We also support CLIPScore using CLIP, and ITMScore using BLIPv2:

llava_score = t2i_metrics.VQAScore(model='llava-v1.5-13b') # LLaVA-1.5 is the second best
clip_score = t2i_metrics.CLIPScore(model='openai:ViT-L-14-336')
blip_itm_score = t2i_metrics.ITMScore(model='blip2-itm')

You can check all supported models by running the below commands:

print("VQAScore models:")
print(t2i_metrics.list_all_vqascore_models())
print()
print("ITMScore models:")
print(t2i_metrics.list_all_itmscore_models())
print()
print("CLIPScore models:")
print(t2i_metrics.list_all_clipscore_models())

Evaluating on Winoground/EqBen/TIFA

You can easily test on these vision-langauage benchmarks via running

python eval.py --model clip-flant5-xxl
python eval.py --model llava-v1.5-13b
python eval.py --model blip2-itm
python eval.py --model openai:ViT-L-14

# You can optionally specify question/answer template, for example:
python eval.py --model clip-flant5-xxl --question "Question: Is the image showing '{}'?" --answer "Yes"

Implement New Scoring Metrics

You can easily implement your own scoring metric. For example, if you have a stronger VQA model, you can include it in t2i_metrics/models/vqascore_models. Please check out our implementation for LLaVA-1.5 and InstructBLIP as a starting point.

Acknowledgements

This repository is inspired from the Perceptual Metric (LPIPS) repository by Richard Zhang for automatic evaluation of image-to-image similiarity.

Nyandwi / t2i_metrics