uta-smile / TCL

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can not reproduce zero-shot retrieval performance

yangbang18 opened this issue · comments

Hi, I have downloaded the pre-trained checkpoint TCL_4m.pth you provided and prepared Flickr30k.

I run the following command:

python -m torch.distributed.launch \
--nproc_per_node=4 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr_zeroshot \
--checkpoint ./data/TCL_4M.pth \
--evaluate

Here are the results I get:

{"val_txt_r1": 87.96844181459566, "val_txt_r5": 98.12623274161736, "val_txt_r10": 99.40828402366864, "val_txt_r_mean": 95.16765285996057, "val_img_r1": 72.07100591715977, "val_img_r5": 90.55226824457594, "val_img_r10": 94.5759368836292, "val_img_r_mean": 85.73307034845497, "val_r_mean": 90.45036160420777, "test_txt_r1": 89.4, "test_txt_r5": 98.6, "test_txt_r10": 99.6, "test_txt_r_mean": 95.86666666666667, "test_img_r1": 73.36, "test_img_r5": 92.16, "test_img_r10": 95.52, "test_img_r_mean": 87.01333333333332, "test_r_mean": 91.44, "epoch": 0}

According to the Table 2 in your paper, zero-shot R@1 performance on Flickr30K test set is 93.0 (text retrieval) and 79.6 (image retrieval). But what I get is test_txt_r1 = 89.4 and text_img_r1 = 73.36.

Do I make something wrong?

I have noticed the reason for the inconsistent performance, i.e., "the zero-shot result on flickr is evaluated using the model finetuned on COCO". Sorry for my carelessness.