HSF / prmon

Standalone monitor for process resource consumption

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prmon spawning nvidia-smi?

vrpascuzzi opened this issue · comments

When launching a relatively large number of parallel athena jobs (which uses prmon), an equally large number of nvidia-smi processes are spawned. Note that I am not using a GPU or any CUDA code in these jobs, but a CUDA installation is found when configuring cmake. Also, these spawned processes continue to run even after logging out of the machine.

While I admit our machine hasn't been too stable these days, this behaviour -- many nvidia-smi processes being launched "behind the scenes" -- is causing a major overload of the system, ultimately requiring a reboot.

Apologies in advance if this is unrelated to prmon.

Hi Vince

Thanks for the report. It is true that prmon will run nvidia-smi if it finds it. Also, if there is a GPU found then each monitoring cycle nvidia-smi wil be invoked to see if there's any processes that have started on the GPU by the monitored job. (In contrast, if you don't have a GPU then prmon will forget about GPU monitoring.)

We never saw an issue with the nvidia-smi processes hanging and in fact prmon waits will the nvidia-smi has exited so that it can read the output (there's a waitpit() call).

So... I doubt that it's a prmon spawned monitor process, but maybe you could check by doing something like:

  • Running nvidia-smi pmon -s um -c 1 (code) - check that it exits.

    • BTW, can you see if the orphaned nvidia-smis have those command line arguments?
  • Running prmon -- sleep 300 and check there's no accumulation of nvidia-smi processes.

At least that could give us a clue as to whether there is something fishy going on from prmon.

Graeme

Thanks, @graeme-a-stewart. I will follow-up when our machine is back online.
In the meantime, is there an option to disable GPU monitoring?

HI @vrpascuzzi - ah, I'm sorry that's not available at the moment. @amete and I were having a long discussion about it a while ago (#107) but we didn't converge on what the syntax should be. But given what your asking for I think it makes it clearer that something like

--disable nvidiamon

would do exactly what you want, right? We'll try and reinvigorate that and conclude for the next realease.

That would work.

Hi @vrpascuzzi, just to say we did implement a way to disable particular monitors in master now (#178). Did you make any progress re. seeing if it was prmon that was launching the nividia-smi processes that looked stuck?

I would suggest having the default be no GPU monitoring, with it enabled on request, as I think having a GPU is much less common than not having.

The issue that Vince and I were having is likely due to the fact that there are 3 GPUs on the server, 2 NVIDIA and one AMD. The kernel crash logs suggest that the problem lies with the kernel trying to switch between GPUs, and something bad happening with the AMD amdgpu kernel module, which somehow ends up corrupting the process table. The server has 72 cores, so when fully loaded, there were a lot of nvidia-smi processes running.

BTW, I removed nvidia-smi from the default path on the server, and since then we haven't had any issues. So while not a smoking gun, is rather suggestive.

@cgleggett - sorry to reply on this so late. Yes, that is a but suggestive, but it still doesn't quite square with the way that prmon works. Very curious. We were asked by ADC to have this moitoring enabled by default, but it would be nicer if it could be signalled to prmon that there will be a job that uses the GPU.

One practical option might be to leave things as is on the prmon side (i.e. everything is enabled by default such that ADC doesn't need to do anything special) but disable GPU metric collection in the instance of prmon that the job transform spawns. At this point, we don't really make use of any GPU resources from athena anyways - at least for now. Therefore, we wouldn't be loosing anything.

Hi again @vrpascuzzi @cgleggett

Further considering this issue we realised that the fix wouldn't help when its the job transform that's launching prmon as you don't have access to the arguments it's invoked with. So we just added a new feature, where you can disable monitors via the PRMON_DISABLE_MONITOR environment variable (see #183 #182).

Unfortunately that will only work from the next realease (but we are about to cut that).

I think that's as much as we can do, so we'll close this issue here, but if you have any further insight into why this was behaving in such an odd way on your node please let us know.

Cheers, g.

Thanks Graeme!

I think we are definitely exploring an unusual corner of phase space, where we have many concurrent jobs, each spawning an nvidia-smi process, and a multi-gpu machine which has an AMD card in it. There are a few articles online that mention issues with the AMD amdgpu kernel module relating to the kernel switching between physical gpus, but it doesn't look common. Between this env var, and the fact that I now only put the nvidia executables in the PATH when I want to do something explicitly with the GPU, I think our issue is addressed. It will be interesting to see if this ever affects others.

cheers, Charles.