fbcotter / py3nvml

Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gpu_temp_max_gpu_threshold missing

leinardi opened this issue Β· comments

I just found out that the GPU Max Operating Temp, exported with the XML tag gpu_temp_max_gpu_threshold, is missing from py3nvml.

Do you have any plan to add it?

Also, another missing tag is the cuda_version.

Hi @leinardi. I'm not sure I totally understand what you mean. Are you referring to the xml dump from py3nvml.nvidia_smi? There is a tag called gpu_temp_max_threshold in there. As for the cuda_version, I don't know if this is possible to get from NVML although I may be incorrect. You can certainly get the driver version, but the cuda version will depend on what library file you have installed on your machine.

When you say it is missing, do you mean it is available in nvml but not in py3nvml? If so, I can probably find a way to wrap the nvml function and add it.

There is a tag called gpu_temp_max_threshold in there.

Hi @fbcotter, gpu_temp_max_threshold is actually another temperature:

		<temperature>
			<gpu_temp>38 C</gpu_temp>
			<gpu_temp_max_threshold>94 C</gpu_temp_max_threshold>
			<gpu_temp_slow_threshold>91 C</gpu_temp_slow_threshold>
			<gpu_temp_max_gpu_threshold>89 C</gpu_temp_max_gpu_threshold>
			<memory_temp>N/A</memory_temp>
			<gpu_temp_max_mem_threshold>N/A</gpu_temp_max_mem_threshold>
		</temperature>

As for the cuda_version, I don't know if this is possible to get from NVML

The cuda_version is now part of the nvidia-smi output:

<nvidia_smi_log>
	<timestamp>Thu Jan  3 13:25:23 2019</timestamp>
	<driver_version>415.25</driver_version>
	<cuda_version>10.0</cuda_version>
	<attached_gpus>1</attached_gpus>
...

Ahh I see, are these screenshots taken from the xml dump of nvidia-smi? I can look at whether this info is possible to get.

Yep, that's just the output of nvidia-smi -q -x.

The video clock I think I can add. The gpu_temp_max_gpu_threshold tag is not available for my GPU system so I can't check it. Can you check the following things for me please? If you run:

from py3nvml.py3nvml import *
handle = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetClockInfo(handle, 3)

Does this give you the expected video_clock output?

Also, I think the Temperature threshold you are looking for can be got with

from py3nvml.py3nvml import *
handle = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetTemperatureThreshold(handle, 2)

Is that right?

If so, we can add this into the py3nvml.nvidia_smi function. The CUDA version we might be able to query from nvmlSystemGetNVMLVersion() - calling that for me gives '10.410.72' where 410.72 is my driver version.

However in saying all this, the nvidia_smi output will continually be updated by NVIDIA, I worry that trying to constantly update the py3nvml.nvidia_smi function to ensure they provide the same info will be a laborious endeavour. If there are only these 3 (plus perhaps a few more) tags missing, we can update it this time, but I don't know the full extent. I'm also happy if you need the py3nvml.nvidia_smi function to be up-to-date for you to keep doing pull requests for it.

Let me know whether the above code works for you.

from py3nvml.py3nvml import *
handle = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetClockInfo(handle, 3)

This works πŸ‘

from py3nvml.py3nvml import *
handle = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetTemperatureThreshold(handle, 2)

This gives me this error:

Traceback (most recent call last):
  File "/home/leinardi/Workspace/gitlab/gwe/run", line 29, in <module>
    print("Device {}: {}".format(i,  nvmlDeviceGetTemperatureThreshold(handle, 2)))
  File "/home/leinardi/.local/lib/python3.6/site-packages/py3nvml/py3nvml.py", line 1113, in nvmlDeviceGetTemperatureThreshold
    _nvmlCheckReturn(ret)
  File "/home/leinardi/.local/lib/python3.6/site-packages/py3nvml/py3nvml.py", line 317, in _nvmlCheckReturn
    raise NVMLError(ret)
py3nvml.py3nvml.NVMLError_NotSupported: Not Supported

Hi @fbcotter, I just found out that the right value for the max gpu threshold is 3 and not 2:
https://github.com/NVIDIA/nvidia-settings/blob/master/src/nvml.h#L518

I tested it and works fine πŸ‘

from py3nvml.py3nvml import *
handle = nvmlDeviceGetHandleByIndex(0)
nvmlDeviceGetTemperatureThreshold(handle, 3)

I am planning to use your library for my app, GWE:
image

I don't care about the py3nvml.nvidia_smi being maintained (I am not using it) but I would like to know if you are still planning maintaining the library or if it would be better to just add and maintain the source inside my app source. I am fine also in making pull requests to this repo but I would like to know how long it could take to be approved and a new release of the lib released before investing effort on it.

Oh nice work, good find. I can look at seeing how easy it will be to update the package. I think there are several enums that need updating. Your app looks really nice!

As for updates, I regularly add features to py3nvml that I find useful so I do plan on maintaining it for the near future at least.

I'm currently in the process of updating py3nvml to match with the newer version of nvml, so will keep this issue open until I finish this work, hopefully in the next week or so.

Hey @fbcotter, I finally managed to publish GWE on Flathub and I just want to say thank you for the nice library πŸ‘

Well done, that looks really nice!

I just pushed to master an update that cleans up the old enums, the root of the problem you were talking about in this thread. I also added docstrings (copied the C style ones).

I haven't decided if I want to update the xml function as you can get all you want from the lowlevel functions now. I'll keep the issue open as I think there's a lot more to think about.

Thanks a lot, looking forward for a new release πŸ‘

Published the new release. Thanks for pointing out the problems.