Poor Performance on GCP V100s

Question

Poor Performance on GCP V100s

xanderdunn opened this issue 5 years ago · comments

I'm running the code nearly unchanged on a Google Cloud Compute instance with 2x Nvidia V100, 60GB RAM, 16CPUs. config.py is unchanged.

With 15 epochs on the Quarterly data, total training time is 16.01 minutes, almost double the 8.94minutes shown in the paper. However, the validation results at the end of epoch 15 are nearly identical to the paper's reported results:

{'Demographic': 10.814908027648926, 'Finance': 10.71678638458252, 'Industry': 7.436440944671631, 'Macro': 9.547700881958008, 'Micro': 11.63847827911377, 'Other': 7.911505699157715, 'Overall': 10.091866493225098, 'loss': 7.8162946701049805}

When I remove the model saving step, training time decreased to 15.76 minutes.

I downloaded the dataset from the provided link, and made no changes.

I'm using updated package versions, although I wouldn't expect this to halve performance:

pytorch 1.2
tensorflow 1.14.0

What hardware configuration was the authors' testing done on? I'm using dual V100s, the highest-end GPUs available on GCP. I'd expect to match or outperform the reported benchmarks. Do you have any thoughts on why my performance is considerably worse in my situation?

Xander Dunn · Answer 1 · Tue Oct 08 2019 07:37:35 GMT+0800 (China Standard Time)

Some example training output:

Kaung · Answer 2 · Tue Oct 08 2019 08:08:29 GMT+0800 (China Standard Time)

Wow, you are definitely using a larger GPU than we used which was a 1080Ti. Increase your batch size from the code until you run out of GPU memory then you will see a lot of improvements. Also, no need to use multi-gpu setup, the code is not written to use multi-GPUs.

Xander Dunn · Answer 3 · Tue Oct 08 2019 08:30:01 GMT+0800 (China Standard Time)

@damitkwr Thanks a lot, that helped. Batch size 4096 completed training of quarterly data in 6.46 minutes. Interesting comparison on loss convergence vs batch size, where the orange and gray are batch size 2014 and blue and red are 4096: