GPU utilization in AWS

Question

GPU utilization in AWS

MarKo9 opened this issue 6 years ago · comments

Hi,

Once again thanks for the effort.
Using the previous version of the library on AWS (p2.8xlarge) on a ~250GB dataset and although it seems that all GPUs are available
Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:17.0, compute capability: 3.7)
2018-10-17 06:30:29.402721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:18.0, compute capability: 3.7)
2018-10-17 06:30:29.402734: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:19.0, compute capability: 3.7)
2018-10-17 06:30:29.402745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:1a.0, compute capability: 3.7)
2018-10-17 06:30:29.402757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:1b.0, compute capability: 3.7)
2018-10-17 06:30:29.402768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:1c.0, compute capability: 3.7)
2018-10-17 06:30:29.402779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:1d.0, compute capability: 3.7)
2018-10-17 06:30:29.402801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)

when checking the utilization only one GPU is utilized

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2244 C /home/ubuntu/src/anaconda3/bin/python 10912MiB |
| 1 2244 C /home/ubuntu/src/anaconda3/bin/python 10858MiB |
| 2 2244 C /home/ubuntu/src/anaconda3/bin/python 10858MiB |
| 3 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 4 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 5 2244 C /home/ubuntu/src/anaconda3/bin/python 10856MiB |
| 6 2244 C /home/ubuntu/src/anaconda3/bin/python 10854MiB |
| 7 2244 C /home/ubuntu/src/anaconda3/bin/python 10854MiB |
+-----------------------------------------------------------------------------+

Is the library designed so as to utilize all GPUs available in the system by default?

main library versions:
tensorflow 1.4.0rc0
numpy 1.13.3
pandas 0.20.3 py36h6022372_2
Cuda compilation tools, release 9.0, V9.0.176

Thanks in advance.

Chirag Lakhani · Answer 1 · Sat May 04 2019 05:24:24 GMT+0800 (China Standard Time)

@MarKo9 Did you figure this out? I am having the same issue.

Ranjit Lall · Answer 2 · Mon May 06 2019 16:57:54 GMT+0800 (China Standard Time)

Any thoughts about this? Did you get my previous message btw? Ranjit From: MarKo9 <notifications@github.com> Reply-To: Oracen/MIDAS <reply@reply.github.com> Date: Saturday, May 4, 2019 at 3:27 AM To: Oracen/MIDAS <MIDAS@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [Oracen/MIDAS] GPU utilization in AWS (#9) My problem was that I was using a huge dataset (42m x 600) and it took forever to impute. However the problem with the library more than the gpu utilization (which I did not solve to answer your question) is that it does a lot of calculations in one core and not vectorized and this is where the big problems with speed comes from. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#9 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AINSGR5XRMV3TIS4VGXDTSLPTTX7RANCNFSM4F635JVQ>.

Ranjit Lall · Answer 3 · Mon May 06 2019 20:22:00 GMT+0800 (China Standard Time)

I didn't get the previous message. Will look tomorrow evening, got two meetings with VC and CEO tomorrow. Of course it's not using more than one core, that would require custom architecture-dependent code. He's throwing P3 nodes at code designed for consumer GPUs. The guy's a numpty and if he's so concerned he should code his own infrastructure. MIDAS is still in limbo while I work on company code. We're close to seed funding, and as mentioned previously; until money is coming in again, all effort is directed to securing income. Apologies, Alex

…

On Mon, May 6, 2019 at 6:57 PM Ranjit Lall ***@***.***> wrote: Any thoughts about this? Did you get my previous message btw? Ranjit *From: *MarKo9 ***@***.***> *Reply-To: *Oracen/MIDAS < ***@***.***> *Date: *Saturday, May 4, 2019 at 3:27 AM *To: *Oracen/MIDAS ***@***.***> *Cc: *Subscribed ***@***.***> *Subject: *Re: [Oracen/MIDAS] GPU utilization in AWS (#9) My problem was that I was using a huge dataset (42m x 600) and it took forever to impute. However the problem with the library more than the gpu utilization (which I did not solve to answer your question) is that it does a lot of calculations in one core and not vectorized and this is where the big problems with speed comes from. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#9 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AINSGR5XRMV3TIS4VGXDTSLPTTX7RANCNFSM4F635JVQ> .[image: https://github.com/notifications/beacon/AINSGR4YLT6AKFHDDPK5L73PTTX7RA5CNFSM4F635JV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODUU6WKI.gif]

Ranjit Lall · Answer 4 · Mon May 06 2019 20:28:01 GMT+0800 (China Standard Time)

No worries, I was just asking how the business is doing and saying we should catch up soon. Ranjit From: Alex Stenlake <alex.stenlake@gmail.com> Date: Monday, May 6, 2019 at 1:21 PM To: Ranjit Lall <ranjitlall@hotmail.com> Cc: Oracen/MIDAS <reply@reply.github.com> Subject: Re: [Oracen/MIDAS] GPU utilization in AWS (#9) I didn't get the previous message. Will look tomorrow evening, got two meetings with VC and CEO tomorrow. Of course it's not using more than one core, that would require custom architecture-dependent code. He's throwing P3 nodes at code designed for consumer GPUs. The guy's a numpty and if he's so concerned he should code his own infrastructure. MIDAS is still in limbo while I work on company code. We're close to seed funding, and as mentioned previously; until money is coming in again, all effort is directed to securing income. Apologies, Alex On Mon, May 6, 2019 at 6:57 PM Ranjit Lall <ranjitlall@hotmail.com<mailto:ranjitlall@hotmail.com>> wrote: Any thoughts about this? Did you get my previous message btw? Ranjit From: MarKo9 <notifications@github.com<mailto:notifications@github.com>> Reply-To: Oracen/MIDAS <reply@reply.github.com<mailto:reply%2BAINSGR3Z2VHFYCWNFILM6XV23IVPREVBNHHBMP5PDQ@reply.github.com>> Date: Saturday, May 4, 2019 at 3:27 AM To: Oracen/MIDAS <MIDAS@noreply.github.com<mailto:MIDAS@noreply.github.com>> Cc: Subscribed <subscribed@noreply.github.com<mailto:subscribed@noreply.github.com>> Subject: Re: [Oracen/MIDAS] GPU utilization in AWS (#9) My problem was that I was using a huge dataset (42m x 600) and it took forever to impute. However the problem with the library more than the gpu utilization (which I did not solve to answer your question) is that it does a lot of calculations in one core and not vectorized and this is where the big problems with speed comes from. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#9 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AINSGR5XRMV3TIS4VGXDTSLPTTX7RANCNFSM4F635JVQ>.

MarKo9 · Answer 5 · Mon May 06 2019 21:29:35 GMT+0800 (China Standard Time)

Nice manners against people testing your library and bother sharing any issues they found even if they are wrong.
BTW i was concerned and I did code my own solution, just not based on your library.
A MICE-like xgb based custom imputation was way more accurate at least on my dataset (no need to say about the speed), you can test it against it before your next release.

Chirag Lakhani · Answer 6 · Mon May 06 2019 22:02:22 GMT+0800 (China Standard Time)

@MarKo9 do you have code fo this? Would be interested in checking it out. Thanks for the reply @ranjitlall