hotpotqa / hotpot

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Very slow training?

valsworthen opened this issue · comments

Hello,

I am trying to run your code on a Tesla P100 and it takes more than an hour to compute 1000 steps of the first epoch. I noticed that the 16Go of GPU memory are completely used but the "GPU-Utilization" from nvidia-smi is at only 20%, meaning that there is a serious problem of optimization. Is that normal or am I missing something?

Thanks.

I also use Telsa P100 with 16GB memory, but my GPU-Utilization is 66%.

I am using GTX1080TI, GPU-Utilization is 86%, about 780ms/batch.

@valsworthen given the other two reports here, I would suggest looking into other issues than the GPU: pytorch version, I/O delays (especially file system delays, which could vary a lot depending on your specific setup), among others :)

May I ask for the total time you take to train the baseline? Hope to get an estimation before I start. Thank you!

Training speed could vary greatly given the specific setup of your hardware/infrastructure, but here's another data point: on a Titan Xp, I'm able to achieve a GPU utility of 60-70% on average, and training speed is about 1500ms/batch (minibatch size of 40).

This means roughly 1/2 hours per checkpoint, and depending on your training schedule, the number of actual checkpoints could vary. For instance, the default hyperparameters will take at least checkpoints to stop training (at least 3-4 hrs).