Unable to reproduce the experimental results

Question

Unable to reproduce the experimental results

longlongman opened this issue 4 years ago · comments

I train the model on DialogRE dataset and I can not reproduce the experimental results. Here are my experimental results.

The last two rows are my results. My pytorch and CUDA are the same as the requirements. I use Tesla v100 and TITAN RTX to try to reproduce the results respectively. I can understand that the results are not exactly the same due to the different hardware, but my results are surprisingly worse. There may be something wrong rather than the experimental error. By the way, the RTX 1070 GPU is not a real GPU. Do you mean RTX 2070 GPU or GTX 1070 GPU?

Xue Fuzhao · Answer 1 · Wed Dec 30 2020 12:56:50 GMT+0800 (China Standard Time)

Thank you for your interest. I will fix the readme soon.
Did you change the --gradient_accumulation_steps 6?
This value would influent the result of SoftDTW loss.

longlongman · Answer 2 · Wed Dec 30 2020 14:44:36 GMT+0800 (China Standard Time)

Thank you for replying. I did not change anything in this project. Here is the screenshot of the run_GDPNet.sh file, I used.

You can see that I did not change the gradient_accumulation_steps option (red underline). But I am not sure whether I use the right version BERT model (green underline). Would you like to tell me which specific BERT you used in your experiment?

Xue Fuzhao · Answer 3 · Wed Dec 30 2020 14:47:56 GMT+0800 (China Standard Time)

Hi, uncased BERT should be used. I uploaded the correct BERT version just now. See README.md. If any further questions, pls fell free to discuss. Thank you again for the suggestions! :)

longlongman · Answer 4 · Sat Jan 02 2021 20:45:35 GMT+0800 (China Standard Time)

Sorry to let you know that I am still unable to reproduce the results even with uncased BERT. Here are my results. With the uncased BERT, the results are worse than with the cased BERT😂. I am confident that I use the right version of BERT because I get comparable results with the uncased BERT when I reproduce the results of BERTs. Also, I double-check the GDPNet results.

Xue Fuzhao · Answer 5 · Sat Jan 02 2021 22:54:26 GMT+0800 (China Standard Time)

Sorry to hear that. There must be something wrong with the GDPNet. We used BERTs as the backbone to train GDPNet and Residual Connection is adopted, so it is impossible that GDPNet is much worse than BERTs. To help you reimplement the result, I found a V100 device and download my code. I used the BERT-base-uncased (PyTorch format) as uploaded in the README.md recently as the backbone. I guess that converting BERT from TensorFlow format to torch format by different Tensorflow versions would influent the result. You can find my BERT backbone in the README.md.

Could you please try to evaluate GDPNet by the model we released first? What is the result when you evaluate our model directly? This is to check the forward process and prediction.
Or further finetune our model released on your device? In this case, we can check whether the problem comes from initialization and backpropagation. If the model went much worse, there would be something wrong with optimization.
I recently trained our model by three different clusters and produced similar results. The environment of my V100 cluster:
PyTorch == 1.6.0
CUDA == 10.1
You can found the model on README.md. I don't think the bad result comes from different hardware devices.
I believe these tips would help you. If you still cannot reproduce the result, please send the required files(Code and your BERT backbone) to my email (xuefuzhao@outlook.com). I can try my best to help you run the code on my device again and send you the log and model.