Discrepancy Between Reported Results and Reproduction Attempts for DinoV2+SALAD

Question

Discrepancy Between Reported Results and Reproduction Attempts for DinoV2+SALAD

Ahmedest61 opened this issue 4 months ago · comments

Hello,

I am reaching out for assistance regarding the reproducibility of the DinoV2+SALAD model as detailed in your recent publication. I have followed the training and evaluation pipeline provided in the repository and utilized the conda environment you provided. However, my results are not aligning with those reported in the paper, specifically in Table 3 for the DinoV2+SALAD model.

Steps to Reproduce:

Repository cloned from the provided link.
Environment set up using the provided environment.yml.
Followed the training pipeline instructions in the README, with no modifications to the default parameters.
Ran the main script to perform the training.
Ran evaluation script to retrieve the results.

Expected Results:
The reported results in your paper for DinoV2+SALAD:

Recall@1/5/10 of 92.2/96.4/97.0 on MSLS Val
Recall @1/5/10 of 76.0/89.2/92.0 on Nordland
Recall @1/5/10 of 92.1/96.2/96.5 on SPED
Recall @1/5/10 of 95.1/98.5/99.1 on Pitts30k_test

Actual Results:
The results I obtained were as follows:

Recall@1/5/10 of 91.22/96.35/96.7 on MSLS Val
Recall@1/5/10 of 71.41/85.65/88.91 on Nordland
Recall@1/5/10 of 90.94/95.72/96.38 on SPED
Recall@1/5/10 of 92.19/96.26/97.40 on Pitts30k_test

Considering the differences in outcomes, I would like to inquire if there might be any additional configurations or parameters that were applied to achieve the results in the paper that may not be documented in the repository.

Your assistance in resolving these reproducibility concerns would be invaluable, not only for my understanding but also for the benefit of the community at large. I am looking forward to your response and any guidance you can provide.

Thank you for your time and consideration.
hparams.txt
metrics.csv
results.txt

Sergio Izquierdo · Answer 1 · Thu Feb 08 2024 00:35:32 GMT+0800 (China Standard Time)

Hi @Ahmedest61

First of all, have you tried if the provided weights obtain the reported results (or very similar), these weights were obtained with this code.
Secondly, you should keep in mind that given the aggressive learning rate that we use for faster convergence, different runs of the training might end up with slightly different results.
Finally, a few things I recommend to further improve the training metrics: Make sure to evaluate on big image resolution (like 322x322), try to train at full precision, and try to evaluate at full precision.

Hope this helps!

Ahmad khaliq · Answer 2 · Mon Feb 19 2024 00:40:11 GMT+0800 (China Standard Time)

Hey @serizba,

Thanks for your prompt and informative response. Your insights are greatly appreciated.

I have indeed utilized the provided checkpoint weights and can confirm that they produce results closely aligning with those reported. This step was instrumental in verifying the baseline performance of the system.

However, in my subsequent endeavors to train the SALAD system from scratch using the default configuration, I observed a noticeable discrepancy in performance. Specifically, there was an absolute reduction, e.g., 5% and 2% in Recall@1 for the Nordland and SPED datasets, respectively. This variance suggests a divergence from the expected outcomes based on the initial benchmarks.

Pursuant to your advice, I also experimented with training and evaluating the system at a higher image resolution (322x322) and with full precision settings. Despite these adjustments, the results mirrored the earlier findings, with lower Recall performance persisting for both the SPED and Nordland datasets.

These observations lead me to ponder if there might be additional nuances or configurations beyond the aggressive learning rate strategy and resolution adjustments that could potentially bridge the gap between the expected and actual performance metrics.

Bingxi Liu · Answer 3 · Tue Mar 26 2024 14:41:57 GMT+0800 (China Standard Time)

Hello @serizba,

The performance of checkpoint provided is very impressive, which confirms the results of the paper.
But I encountered a similar problem to @Ahmedest61 during the replication training process, with almost the same experimental results as @Ahmedest61.
__________________ SPED | Nordland
The pre-trained model | 92.09 | 76.49
The replicated model | 90.94 | 70.07
DINO-NetVLAD(8192) | 90.60 | 70.10

I am more concerned about the two issues that arise from this:

The experimental results of our reproduced SALAD are similar to those of the NetVLAD method in the ablation experiment, making it impossible to determine which is better between SALAD and NetVLAD.
If an aggressive learning rate setting may lead to experimental differences, then the conclusion about how many layers of network to freeze is also questionable?

Looking forward to your answer.

Sergio Izquierdo · Answer 4 · Tue Mar 26 2024 16:32:48 GMT+0800 (China Standard Time)

Hi @BinuxLiu

Indeed, there is a bit of noise on the training, which may produce slightly different results for different runs. This is specially noticeable on Nordland and also happens with other models like MixVPR (as confirmed by the authors).

Regarding your points:

One of the advantages of SALAD is that it can easily allow for significantly smaller descriptors. We will soon update the camera ready version of the paper with results with smaller descriptors (512, 2048). As shown the Ablations table, NetVLAD quickly looses performance when using a dimensionality reduction.
We base our conclusions on our own empirical observations. Although multiple runs of the methods may bring subtle differences.

Best

Bingxi Liu · Answer 5 · Tue Mar 26 2024 16:43:01 GMT+0800 (China Standard Time)

Hi @serizba

Thanks for your prompt and informative response.