yalesong / pvse

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval (CVPR 2019)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some problems about your paper

pwy-cmd opened this issue · comments

I have read your article and I am very interested in your paper. But there are some problems when I repeat your experiment: when I use the coco dataset to evaluate the pretrained model, I find that the results are not consistent with the results you mentioned in the paper. The only difference is that I use Torch==1.2.0, torchvison==0.4.0. Here are my results:

H:***\Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval\pvse-master>python eval.py --data_name coco --num_embeds 2 --img_attention --txt_attention --ckpt ./ckpt/coco_pvse.pth

Loading dataset
loading annotations into memory...
Done (t=0.41s)
creating index...
index created!
Computing results... (eval_on_gpu=CPU)
Images: 5000, Sentences: 25000
Image to text: 0.00, 0.30, 0.60, 773.00 (0.03), 1016.47 (0.04)
Text to image: 0.14, 0.52, 1.00, 500.00 (0.02), 500.83 (0.02)
rsum: 2.56 ar: 0.30 ari: 0.55
Image to text: 0.00, 0.60, 1.40, 781.00 (0.03), 1017.42 (0.04)
Text to image: 0.10, 0.48, 0.96, 501.00 (0.02), 500.82 (0.02)
rsum: 3.54 ar: 0.67 ari: 0.51
Image to text: 0.00, 0.20, 0.90, 776.00 (0.03), 987.25 (0.04)
Text to image: 0.14, 0.46, 0.98, 500.00 (0.02), 500.38 (0.02)
rsum: 2.68 ar: 0.37 ari: 0.53
Image to text: 0.10, 0.30, 0.50, 778.00 (0.03), 989.40 (0.04)
Text to image: 0.06, 0.50, 1.04, 501.00 (0.02), 500.60 (0.02)
rsum: 2.50 ar: 0.30 ari: 0.53
Image to text: 0.10, 0.50, 0.70, 778.00 (0.03), 985.18 (0.04)
Text to image: 0.10, 0.50, 1.00, 500.00 (0.02), 500.52 (0.02)
rsum: 2.90 ar: 0.43 ari: 0.53

Mean metrics from 5-fold evaluation:
rsum: 17.02
Average i2t Recall: 0.41
Image to text: 0.04 0.38 0.82 777.20 (0.03) 999.14 (0.04)
Average t2i Recall: 0.53
Text to image: 0.11 0.49 1.00 500.40 (0.02) 500.63 (0.02)
rsum: 0.58
Average i2t Recall: 0.08
Image to text: 0.04 0.06 0.14 3891.00 (0.16) 4999.44 (0.20)
Average t2i Recall: 0.11
Text to image: 0.02 0.10 0.22 2502.00 (0.10) 2501.31 (0.10)

Thank you for reading and I am looking forward to your reply

I can't seem to reproduce your results. Even with torch version 1.2.0 and torchvision version 0.4.0, I am getting the results as expected:

python3 eval.py --data_name coco --num_embeds 2 --img_attention --txt_attention --ckpt ./ckpt/coco_pvse.pth

Loading dataset
loading annotations into memory...
Done (t=0.21s)
creating index...
index created!
Computing results... (eval_on_gpu=False)
Images: 5000, Sentences: 25000
Image to text: 70.80, 92.70, 97.10, 1.00 (0.00), 2.20 (0.00)
Text to image: 56.48, 87.90, 94.30, 1.00 (0.00), 4.62 (0.00)
rsum: 499.28 ar: 86.87 ari: 79.56
Image to text: 69.50, 90.10, 96.20, 1.00 (0.00), 2.67 (0.00)
Text to image: 55.34, 86.02, 92.62, 1.00 (0.00), 4.87 (0.00)
rsum: 489.78 ar: 85.27 ari: 77.99
Image to text: 69.30, 91.50, 96.80, 1.00 (0.00), 2.76 (0.00)
Text to image: 54.60, 86.46, 94.18, 1.00 (0.00), 5.07 (0.00)
rsum: 492.84 ar: 85.87 ari: 78.41
Image to text: 67.20, 91.30, 96.50, 1.00 (0.00), 2.79 (0.00)
Text to image: 53.02, 85.24, 93.50, 1.00 (0.00), 4.28 (0.00)
rsum: 486.76 ar: 85.00 ari: 77.25
Image to text: 69.40, 92.50, 96.60, 1.00 (0.00), 5.06 (0.00)
Text to image: 56.60, 86.90, 94.06, 1.00 (0.00), 4.61 (0.00)
rsum: 496.06 ar: 86.17 ari: 79.19
-----------------------------------
Mean metrics from 5-fold evaluation:
rsum: 2957.66
Average i2t Recall: 85.83
Image to text: 69.24 91.62 96.64 1.00 (0.00) 3.09 (0.00)
Average t2i Recall: 78.48
Text to image: 55.21 86.50 93.73 1.00 (0.00) 4.69 (0.00)
rsum: 374.28
Average i2t Recall: 67.97
Image to text: 45.18 74.28 84.46 2.00 (0.00) 11.47 (0.00)
Average t2i Recall: 56.79
Text to image: 32.42 62.97 74.96 3.00 (0.00) 19.17 (0.00)

Can you try it on another machine? I am closing this issue for now because I can't reproduce it, but feel free to re-open this if you have further development on the issue.