Feature Request: Add GPU selection argument for CLI

Question

Feature Request: Add GPU selection argument for CLI

luigifcruz opened this issue 3 years ago · comments

This is a feature requested on Slack for an argument to select which GPU device the current instance will utilize. For example --gpu-id 0.

Luigi Cruz · Answer 1 · Thu Jul 15 2021 02:42:45 GMT+0800 (China Standard Time)

I assign myself to this issue.

Kevin Lacker · Answer 2 · Fri Jul 16 2021 04:59:27 GMT+0800 (China Standard Time)

For what it's worth, in my experience, setting the CUDA_VISIBLE_DEVICES flag like CUDA_VISIBLE_DEVICES=3 turboSETI filename.h5 does not always seem to work. Sometimes it allocates data to GPU 0 anyway.

david-macmahon · Answer 3 · Fri Jul 16 2021 06:04:59 GMT+0800 (China Standard Time)

Unless CUDA_VISIBLE_DEVICES is actively being removed from the environment (possible by not being passed through an SSH connection?), all CUDA apps should be constrained by that setting. If you can come up with a minimum working example to the contrary it would probably be worthy of a bug report to NVIDIA.

Luigi Cruz · Answer 4 · Fri Jul 16 2021 12:01:43 GMT+0800 (China Standard Time)

@lacker Are you able to test my PR with a machine with multiple GPUs?

Kevin Lacker · Answer 5 · Fri Jul 16 2021 12:37:11 GMT+0800 (China Standard Time)

Unless CUDA_VISIBLE_DEVICES is actively being removed from the environment (possible by not being passed through an SSH connection?), all CUDA apps should be constrained by that setting. If you can come up with a minimum working example to the contrary it would probably be worthy of a bug report to NVIDIA.

FWIW I tried to repro but I think my initial comment was wrong here. I think what was happening that I misread as a GPU specification error was, despite running in GPU mode, the turboSETI run was still blocked on CPU for a long time, and so it wasn't using any GPU. Not that it was misreading the CUDA_VISIBLE_DEVICES flag.

Kevin Lacker · Answer 6 · Sat Jul 17 2021 00:54:04 GMT+0800 (China Standard Time)

Sigh, I did manage to repro this. I just ran

CUDA_VISIBLE_DEVICES=3 turboSETI /datag/pipeline/AGBT21A_996_44/blc25/blc25_guppi_59383_54743_TIC316468545_0053.rawspec.0000.h5 -g y -o ~/xxx/

on blpc2, and it started running on GPU 0 instead of GPU 3.

Richard Elkins · Answer 7 · Sat Jul 17 2021 01:14:57 GMT+0800 (China Standard Time)

@luigi
I specified turboSETI -d 3 .... on blpc2 and was assigned to gpu 0 even though gpu 3 was available.
It does run like a bat out of hell but we need to get the gpu assignment down.

Kevin Lacker · Answer 8 · Sat Jul 17 2021 01:25:33 GMT+0800 (China Standard Time)

That's just the python interpreter in the obs conda environment, nvidia-smi truncates it when it displays. So that's just a normal turboseti run. I have observed work getting assigned to gpu 1 there when giving it --gpu_id=2.

Kevin Lacker · Answer 9 · Sat Jul 17 2021 02:22:07 GMT+0800 (China Standard Time)

An update, by default cuda orders GPUs differently (heuristically aiming for 0=best, see https://stackoverflow.com/questions/13781738/how-does-cuda-assign-device-ids-to-gpus ) than nvidia-smi orders them (by pci bus id). To fix this, if you set flags like CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=3, for example, then it will make turboseti use the same GPU 3 that nvidia-smi reports as being GPU 3.

Richard Elkins · Answer 10 · Sat Jul 17 2021 03:07:04 GMT+0800 (China Standard Time)

Fixed in PR #260.
FindDoppler__init__ (find_doppler.py) instantiation of class DATAHandle (data_handler.py) neglected to pass the gpu_id. So, it was defaulted to 0.