prmon spawning nvidia-smi?
vrpascuzzi opened this issue · comments
When launching a relatively large number of parallel athena
jobs (which uses prmon
), an equally large number of nvidia-smi
processes are spawned. Note that I am not using a GPU or any CUDA code in these jobs, but a CUDA installation is found when configuring cmake
. Also, these spawned processes continue to run even after logging out of the machine.
While I admit our machine hasn't been too stable these days, this behaviour -- many nvidia-smi
processes being launched "behind the scenes" -- is causing a major overload of the system, ultimately requiring a reboot.
Apologies in advance if this is unrelated to prmon
.
Hi Vince
Thanks for the report. It is true that prmon will run nvidia-smi
if it finds it. Also, if there is a GPU found then each monitoring cycle nvidia-smi
wil be invoked to see if there's any processes that have started on the GPU by the monitored job. (In contrast, if you don't have a GPU then prmon will forget about GPU monitoring.)
We never saw an issue with the nvidia-smi
processes hanging and in fact prmon
waits will the nvidia-smi
has exited so that it can read the output (there's a waitpit()
call).
So... I doubt that it's a prmon spawned monitor process, but maybe you could check by doing something like:
-
Running
nvidia-smi pmon -s um -c 1
(code) - check that it exits.- BTW, can you see if the orphaned
nvidia-smi
s have those command line arguments?
- BTW, can you see if the orphaned
-
Running
prmon -- sleep 300
and check there's no accumulation ofnvidia-smi
processes.
At least that could give us a clue as to whether there is something fishy going on from prmon.
Graeme
Thanks, @graeme-a-stewart. I will follow-up when our machine is back online.
In the meantime, is there an option to disable GPU monitoring?
HI @vrpascuzzi - ah, I'm sorry that's not available at the moment. @amete and I were having a long discussion about it a while ago (#107) but we didn't converge on what the syntax should be. But given what your asking for I think it makes it clearer that something like
--disable nvidiamon
would do exactly what you want, right? We'll try and reinvigorate that and conclude for the next realease.
That would work.
Hi @vrpascuzzi, just to say we did implement a way to disable particular monitors in master now (#178). Did you make any progress re. seeing if it was prmon that was launching the nividia-smi processes that looked stuck?
I would suggest having the default be no GPU monitoring, with it enabled on request, as I think having a GPU is much less common than not having.
The issue that Vince and I were having is likely due to the fact that there are 3 GPUs on the server, 2 NVIDIA and one AMD. The kernel crash logs suggest that the problem lies with the kernel trying to switch between GPUs, and something bad happening with the AMD amdgpu kernel module, which somehow ends up corrupting the process table. The server has 72 cores, so when fully loaded, there were a lot of nvidia-smi processes running.
BTW, I removed nvidia-smi from the default path on the server, and since then we haven't had any issues. So while not a smoking gun, is rather suggestive.
@cgleggett - sorry to reply on this so late. Yes, that is a but suggestive, but it still doesn't quite square with the way that prmon works. Very curious. We were asked by ADC to have this moitoring enabled by default, but it would be nicer if it could be signalled to prmon that there will be a job that uses the GPU.
One practical option might be to leave things as is on the prmon
side (i.e. everything is enabled by default such that ADC doesn't need to do anything special) but disable GPU metric collection in the instance of prmon
that the job transform spawns. At this point, we don't really make use of any GPU resources from athena
anyways - at least for now. Therefore, we wouldn't be loosing anything.
Hi again @vrpascuzzi @cgleggett
Further considering this issue we realised that the fix wouldn't help when its the job transform that's launching prmon
as you don't have access to the arguments it's invoked with. So we just added a new feature, where you can disable monitors via the PRMON_DISABLE_MONITOR
environment variable (see #183 #182).
Unfortunately that will only work from the next realease (but we are about to cut that).
I think that's as much as we can do, so we'll close this issue here, but if you have any further insight into why this was behaving in such an odd way on your node please let us know.
Cheers, g.
Thanks Graeme!
I think we are definitely exploring an unusual corner of phase space, where we have many concurrent jobs, each spawning an nvidia-smi process, and a multi-gpu machine which has an AMD card in it. There are a few articles online that mention issues with the AMD amdgpu kernel module relating to the kernel switching between physical gpus, but it doesn't look common. Between this env var, and the fact that I now only put the nvidia executables in the PATH when I want to do something explicitly with the GPU, I think our issue is addressed. It will be interesting to see if this ever affects others.
cheers, Charles.