damitkwr / ESRNN-GPU

PyTorch GPU implementation of the ES-RNN model for time series forecasting

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Poor Performance on GCP V100s

xanderdunn opened this issue · comments

I'm running the code nearly unchanged on a Google Cloud Compute instance with 2x Nvidia V100, 60GB RAM, 16CPUs. config.py is unchanged.

With 15 epochs on the Quarterly data, total training time is 16.01 minutes, almost double the 8.94minutes shown in the paper. However, the validation results at the end of epoch 15 are nearly identical to the paper's reported results:

{'Demographic': 10.814908027648926, 'Finance': 10.71678638458252, 'Industry': 7.436440944671631, 'Macro': 9.547700881958008, 'Micro': 11.63847827911377, 'Other': 7.911505699157715, 'Overall': 10.091866493225098, 'loss': 7.8162946701049805}

When I remove the model saving step, training time decreased to 15.76 minutes.

I downloaded the dataset from the provided link, and made no changes.

I'm using updated package versions, although I wouldn't expect this to halve performance:

  • pytorch 1.2
  • tensorflow 1.14.0

What hardware configuration was the authors' testing done on? I'm using dual V100s, the highest-end GPUs available on GCP. I'd expect to match or outperform the reported benchmarks. Do you have any thoughts on why my performance is considerably worse in my situation?

Some example training output:
Screen Shot 2019-10-07 at 4 36 52 PM

commented

Wow, you are definitely using a larger GPU than we used which was a 1080Ti. Increase your batch size from the code until you run out of GPU memory then you will see a lot of improvements. Also, no need to use multi-gpu setup, the code is not written to use multi-GPUs.

@damitkwr Thanks a lot, that helped. Batch size 4096 completed training of quarterly data in 6.46 minutes. Interesting comparison on loss convergence vs batch size, where the orange and gray are batch size 2014 and blue and red are 4096:
Screen Shot 2019-10-07 at 5 29 39 PM