Can not reproduce zero-shot retrieval performance
yangbang18 opened this issue · comments
Hi, I have downloaded the pre-trained checkpoint TCL_4m.pth
you provided and prepared Flickr30k
.
I run the following command:
python -m torch.distributed.launch \
--nproc_per_node=4 \
--use_env Retrieval.py \
--config ./configs/Retrieval_flickr.yaml \
--output_dir output/pretrain_e30_Retrieval_flickr_zeroshot \
--checkpoint ./data/TCL_4M.pth \
--evaluate
Here are the results I get:
{"val_txt_r1": 87.96844181459566, "val_txt_r5": 98.12623274161736, "val_txt_r10": 99.40828402366864, "val_txt_r_mean": 95.16765285996057, "val_img_r1": 72.07100591715977, "val_img_r5": 90.55226824457594, "val_img_r10": 94.5759368836292, "val_img_r_mean": 85.73307034845497, "val_r_mean": 90.45036160420777, "test_txt_r1": 89.4, "test_txt_r5": 98.6, "test_txt_r10": 99.6, "test_txt_r_mean": 95.86666666666667, "test_img_r1": 73.36, "test_img_r5": 92.16, "test_img_r10": 95.52, "test_img_r_mean": 87.01333333333332, "test_r_mean": 91.44, "epoch": 0}
According to the Table 2 in your paper, zero-shot R@1 performance on Flickr30K test set is 93.0 (text retrieval) and 79.6 (image retrieval). But what I get is test_txt_r1 = 89.4
and text_img_r1 = 73.36
.
Do I make something wrong?
I have noticed the reason for the inconsistent performance, i.e., "the zero-shot result on flickr is evaluated using the model finetuned on COCO". Sorry for my carelessness.