Issue with Parallel GPU Computing
shantanudev opened this issue · comments
Hi Sean,
I was wondering if you faced an issue where not all the GPUs are being utilized as evident in the output below. Also, it will not allow me to enter a larger batch even though I have more GPUs.
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:17.0 Off | 0 |
| N/A 82C P0 113W / 149W | 10815MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:18.0 Off | 0 |
| N/A 47C P0 73W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:19.0 Off | 0 |
| N/A 61C P0 57W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 |
| N/A 51C P0 71W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 |
| N/A 63C P0 57W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 |
| N/A 49C P0 70W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 |
| N/A 64C P0 57W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 49C P0 71W / 149W | 208MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6318 C /home/ec2-user/src/torch/install/bin/luajit 10757MiB |
| 1 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 2 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 3 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 4 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 5 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 6 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
| 7 6318 C /home/ec2-user/src/torch/install/bin/luajit 149MiB |
+-----------------------------------------------------------------------------+
I haven't got a multi-GPU node to test this on, but have you set the -nGPU
flag correctly like below?
th Train.lua -nGPU 8
@SeanNaren Yes, I have done this. Basically it limits me to a batch size of about 30 even though I have 8 GPUs.
@SeanNaren Hmm, let me do some investigation on my end. I will let you know.