Very slow training?

Question

Very slow training?

valsworthen opened this issue 6 years ago · comments

Hello,

I am trying to run your code on a Tesla P100 and it takes more than an hour to compute 1000 steps of the first epoch. I noticed that the 16Go of GPU memory are completely used but the "GPU-Utilization" from nvidia-smi is at only 20%, meaning that there is a serious problem of optimization. Is that normal or am I missing something?

Thanks.

Kun Jiang · Answer 1 · Wed Dec 05 2018 17:21:22 GMT+0800 (China Standard Time)

I also use Telsa P100 with 16GB memory, but my GPU-Utilization is 66%.

Vimos Tan · Answer 2 · Tue Jan 08 2019 18:08:56 GMT+0800 (China Standard Time)

I am using GTX1080TI, GPU-Utilization is 86%, about 780ms/batch.

Peng Qi · Answer 3 · Wed Jan 09 2019 06:33:39 GMT+0800 (China Standard Time)

@valsworthen given the other two reports here, I would suggest looking into other issues than the GPU: pytorch version, I/O delays (especially file system delays, which could vary a lot depending on your specific setup), among others :)

Iris Wang · Answer 4 · Fri Feb 08 2019 09:37:33 GMT+0800 (China Standard Time)

May I ask for the total time you take to train the baseline? Hope to get an estimation before I start. Thank you!

Peng Qi · Answer 5 · Tue Feb 12 2019 03:03:40 GMT+0800 (China Standard Time)

Training speed could vary greatly given the specific setup of your hardware/infrastructure, but here's another data point: on a Titan Xp, I'm able to achieve a GPU utility of 60-70% on average, and training speed is about 1500ms/batch (minibatch size of 40).

This means roughly 1/2 hours per checkpoint, and depending on your training schedule, the number of actual checkpoints could vary. For instance, the default hyperparameters will take at least checkpoints to stop training (at least 3-4 hrs).