CLIP based Image Text Matching

This projects employs CLIP (Paper: https://arxiv.org/pdf/2103.00020.pdf) as a backbone to perform image-text retrieval.

The results obtained from the publicly available weights from CLIP do not yield good results on Flickr30K and MSCOCO. The public model achieves:

	Image-to-Text			Text-to-Image
Dataset	R@1	R@5	R@10	R@1	R@5	R@10
MSCOCO-1K	26.1	64.6	81.2	48.0	77.5	88.2
Flickr30k	36.0	71.9	83.4	55.8	80.7	88.3

This project trains a non-linearity on top of CLIP features as a finetuning step to improve the learned representations. The added non-linear probe performs significantly better when fine-tuned in these datasets.

Install

Please follow the installation requirements from the oficial CLIP repository:

https://github.com/openai/CLIP

Generate Data

This model requires to generate two txt files that include the images and the captions to be used by the model.

Run:

$ python generate_data.py

Train

Modify the data_path accordingly in the dataloader.

To train in Flickr30K run:

$ python train.py --data_name f30k --logger_name runs/clip_ft_f30k

To train in MSCOCO run:

$ python train.py --data_name coco --logger_name runs/clip_ft_coco

License

Apache License 2.0

About

CLIP-based simple image-text matching baseline for COCO and F30K

Languages

Language:Python 100.0%