Issue with Parallel GPU Computing

Question

Issue with Parallel GPU Computing

shantanudev opened this issue 8 years ago · comments

Hi Sean,

I was wondering if you faced an issue where not all the GPUs are being utilized as evident in the output below. Also, it will not allow me to enter a larger batch even though I have more GPUs.

+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:17.0 Off | 0 |
| N/A 82C P0 113W / 149W | 10815MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:18.0 Off | 0 |
| N/A 47C P0 73W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:19.0 Off | 0 |
| N/A 61C P0 57W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 |
| N/A 51C P0 71W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 |
| N/A 63C P0 57W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 |
| N/A 49C P0 70W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 |
| N/A 64C P0 57W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 49C P0 71W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6318 C /home/ec2-user/src/torch/install/bin/luajit 10757MiB |
| 1 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 2 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 3 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 4 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 5 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 6 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 7 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
+-----------------------------------------------------------------------------+

Sean Naren · Answer 1 · Fri Nov 18 2016 00:27:58 GMT+0800 (China Standard Time)

I haven't got a multi-GPU node to test this on, but have you set the -nGPU flag correctly like below?

th Train.lua -nGPU 8

Shantanu Dev · Answer 2 · Fri Nov 18 2016 00:45:14 GMT+0800 (China Standard Time)

@SeanNaren Yes, I have done this. Basically it limits me to a batch size of about 30 even though I have 8 GPUs.

Sean Naren · Answer 3 · Fri Nov 18 2016 01:06:45 GMT+0800 (China Standard Time)

I just ran this on our internal AWS K80 server and it worked fine:

It was already running something however all GPUs were used when I used th Train.lua -nGPU. Are you using the latest branch?

Shantanu Dev · Answer 4 · Fri Nov 18 2016 01:10:25 GMT+0800 (China Standard Time)

@SeanNaren Hmm, let me do some investigation on my end. I will let you know.