Use nvidia-settings for GPU load if nvsmi load is NaN

Question

Use nvidia-settings for GPU load if nvsmi load is NaN

PW999 opened this issue a year ago · comments

Describe the solution you'd like

I'm running a GTX770 using the 470 drivers on Manjaro. For some reasons, nvidia-smi doesn't properly report the GPU utilization. In fact, a lot of stuff that's supported by the card doesn't properly work with nvidia-smi, so I'm assuming this might be an issue for all older generations.

nvidia-smi --query-gpu=utilization.gpu --format=csv -l 5                                                                                  
utilization.gpu [%]
[N/A]

As a result of min(100, round(nvidia_gpu.gpu_util, 1)) the card always shows a 100% GPU usage.

nvidia-settings show the GPU utilization correctly

nvidia-settings -c :0 -q '[gpu:0]/GPUUtilization' --terse                                                                              
graphics=32, memory=20, video=0, PCIe=2

It would be nice to have a fallback option to nvidia-settings

Additional context

No response

Phillip · Answer 1 · Mon Aug 14 2023 17:10:38 GMT+0800 (China Standard Time)

If interested, I could give a try implementing this.

Vasilis Koulis · Answer 2 · Thu Aug 17 2023 05:27:08 GMT+0800 (China Standard Time)

I am not sure how to check if the result of the nvidia-smi is correct because I haven't had this issue, but I'd be happy if you could implement this.

Phillip · Answer 3 · Fri Aug 18 2023 03:13:22 GMT+0800 (China Standard Time)

I was thinking it would maybe cleaner to have seperate modules then. So instead of just gpu, it could be gpu_amd, gpu_nvidia_smi and gpu_nvidia_settings with a fallback of gpu to gpu_amd and gpu_nvidia_smi ?

Vasilis Koulis · Answer 4 · Fri Aug 18 2023 03:35:45 GMT+0800 (China Standard Time)

It used to be like this, but I changed it to make it one module for all GPUs because it seemed easier to document and make it easier for the user to understand.

If you want, you can make a completely new module for the gpu_nvidia_settings and add it to the excluded list on the consts.py file. The problem would be that the existing users will be affected.
Another idea would be to create a new folder called for example custom_modules that the user could download and add them manually.

I would recommend the 2nd approach, but it's up to you.

Phillip · Answer 5 · Sun Aug 20 2023 17:00:38 GMT+0800 (China Standard Time)

I went for the 2nd route and published my custom module: https://github.com/PW999/lnxlink_gpu_nvidia_settings

Vasilis Koulis · Answer 6 · Sun Aug 20 2023 20:15:14 GMT+0800 (China Standard Time)

This is awesome!
Thanks for taking the time to implement this!
If you don't mind, I could add it on the documentation so that it's easier found.

I have a minor comment:
You could create different identifiers so that it won't interfere with the original gpu module.

Phillip · Answer 7 · Mon Aug 21 2023 22:43:35 GMT+0800 (China Standard Time)

I somehow thought the name of the module would have an impact on the MQTT topics, but it doesn't, so I renamed it :) .
Feel free to add it to the documentation 👍

Vasilis Koulis · Answer 8 · Fri Aug 25 2023 02:56:58 GMT+0800 (China Standard Time)

I've added your module at the documentation.
Thanks for your contribution to my project!

Vasilis Koulis · Answer 9 · Tue Oct 10 2023 00:55:43 GMT+0800 (China Standard Time)

I got my hands on an older GPU, the GeForce GTX 660 which I installed the 450 driver.
I've updated the dev version of LNXlink which uses the nvidia-smi and falls back to nvidia-settings for the load if it finds a NaN value.

I chose to use only the GPU load because the rest of the of the nvidia-settings results were not correct.

Phillip · Answer 10 · Mon Oct 16 2023 00:54:13 GMT+0800 (China Standard Time)

Isn't it great how nvidia's own software doesn't play well with it's own hardware 😅

Luckily for me it works great most of the times, but I think the issues I'm having are mostly due to it running as a service (headless) which the nvidia-settings doesn't like. Restarting the service usually solves the problem, which makes it even more weird.

Vasilis Koulis · Answer 11 · Mon Oct 16 2023 01:06:28 GMT+0800 (China Standard Time)

For me it doesn't work as a headless installation.
I've tried using XAUTHORITY as environment variable, but it still doesn't work.
How did you manage to get information from nvidia-settings without having an active DISPLAY?

PS. I am using Ubuntu Server without any graphical interface.