Ablation Study (Beta=1) Results not replicable

Question

Ablation Study (Beta=1) Results not replicable

Nishant3815 opened this issue 2 years ago · comments

Hi,

Thanks a ton for making this work open source. I have been following this work and trying to replicate its results for one of my experiments and to validate your consensus hypothesis. In Table 3, with Beta=1; we see that the R@1 for t2i falls from 59.9 to 54.8 and for i2t, it falls from 74.8 to 72.2 .

I tried running this experiment for both FLICKR and MS COCO dataset without changes to the code and the results that I have been getting don't match and moreover suggest that the consensus module has no effect as the performances are more or less similar.

Below are the results:

For COCO DATASET

Results at Beta=1:
Results at Beta=0.9:

and for the 1k test set in COCO, results are:
_

-FOR FLICKR DATASET

During your runs, have you observed similar issue? It would be really helpful if can guide me through this. Currently my results are in contradiction with paper's core hypothesis. It would be really helpful if you can guide me regarding the same.

Wenjing Du · Answer 1 · Mon Feb 20 2023 16:14:49 GMT+0800 (China Standard Time)

Hello, I can't download the data from the link provided by the authors. Could you download the data successfully?

BruceWang · Answer 2 · Fri Mar 24 2023 19:41:14 GMT+0800 (China Standard Time)

111

Hi,

Thanks a ton for making this work open source. I have been following this work and trying to replicate its results for one of my experiments and to validate your consensus hypothesis. In Table 3, with Beta=1; we see that the R@1 for t2i falls from 59.9 to 54.8 and for i2t, it falls from 74.8 to 72.2 .

I tried running this experiment for both FLICKR and MS COCO dataset without changes to the code and the results that I have been getting don't match and moreover suggest that the consensus module has no effect as the performances are more or less similar.

Below are the results:

For COCO DATASET

Results at Beta=1:

Results at Beta=0.9:

and for the 1k test set in COCO, results are: _

-FOR FLICKR DATASET

During your runs, have you observed similar issue? It would be really helpful if can guide me through this. Currently my results are in contradiction with paper's core hypothesis. It would be really helpful if you can guide me regarding the same.

In my experiments, I have transfered the code on another machine with not totally same running environment. And the results are always stable under different versions of numpy and pytorch. Maybe you need to tune the hyper-parameter Beta and the construction of variable tmp in function evaluation.label_complete() in your environment.