Problem setting fanspeed
AxteRay opened this issue · comments
Hi, I have a strange problem here that I don't know how to fix, any help / clues will be appreciated
I have 2*1080Ti GPU, with one monitor plugged into iGPU. My intention is using GPU for CUDA and iGPU for display output. In xorg.conf, first I assigned a screen to each card respectively while setting coolbits, nvfancontrol works normally under this setting. But since the desktop environment is running on GPU:0 and use optimus (or something like that) to output to iGPU then to monitor, not only it takes up VRAM, but is very buggy as well. nvidia-smi
looks like this, Desktop environment takes some VRAM.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 23% 31C P8 8W / 250W | 204MiB / 11178MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 20% 23C P8 7W / 250W | 28MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 10376 G /usr/bin/Xorg.bin 121MiB |
| 0 N/A N/A 10542 G /usr/bin/kwin_x11 38MiB |
| 0 N/A N/A 10587 G /usr/bin/plasmashell 37MiB |
| 1 N/A N/A 10376 G /usr/bin/Xorg.bin 17MiB |
| 1 N/A N/A 10545 G /usr/bin/kwin_x11 7MiB |
+-----------------------------------------------------------------------------+
So I add another screen in xorg.conf running on iGPU, which perfectly addressed the problems above, but nvfancontrol refuse to work now, while I can set fanspeed normally on NV X server settings. nvidia-smi
looks like this, can see that xserver is running normally on both GPU but desktop environment and applications are not on it (on iGPU instead)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
| 23% 34C P8 9W / 250W | 52MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 20% 26C P8 8W / 250W | 52MiB / 11178MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 13864 G /usr/bin/Xorg.bin 17MiB |
| 0 N/A N/A 14173 G /usr/bin/kwin_x11 31MiB |
| 1 N/A N/A 13864 G /usr/bin/Xorg.bin 17MiB |
| 1 N/A N/A 14174 G /usr/bin/kwin_x11 31MiB |
+-----------------------------------------------------------------------------+
I Tried using the binary you posted here and got the following output:
NvidiaControl::init() opening display :0
NvidiaControl::init() display :0 opened successfully
NvidiaControl::init() counting GPUs
NvidiaControl::init() GPUs enumerated
get_version() enter
get_version() querying attribute CTRL_ATTR::NVIDIA_DRIVER_VERSION
X Error of failed request: BadMatch (invalid parameter attributes)
Major opcode of failed request: 156 (NV-CONTROL)
Minor opcode of failed request: 4 ()
Serial number of failed request: 14
Current serial number in output stream: 14
Note that I still can set fanspeed normally on NV X server settings when this happens. And I noticed a difference in NV X server settings. In previous situation, I can see my monitor plus two Xscreen in NV X server settings -> Display Configuration. But I cannot see my monitor in latter one, only two Xscreen
I strongly avoid running desktop environment on Nvidia GPU, so I'm curious about what caused this and are there methods to fix this under the second situation.
Many thanks!
Hi! That's an interesting problem you've got there. XNVCtrl needs a running X display to do anything, but I'm not sure if it needs to be physically attached to an NVIDIA card. FYI the debug build you tried to use won't work anymore so the error you're seeing is irrelevant. The output of nvfancontrol -p
would be more useful in this case. Try cycling the $DISPLAY
env variable to see if there's any difference in output.
I've tried the following command, in both situations I mentioned above, they gave identical output
> nvfancontrol -p
Found 2 available GPU(s)
GPU #0: GeForce GTX 1080 Ti
COOLER-0
GPU #1: GeForce GTX 1080 Ti
COOLER-1
> echo $DISPLAY
:0
This is my xorg.conf, just for reference
Section "ServerLayout"
Identifier "Default Layout"
Screen "Screen2"
Screen "Screen0" RightOf "Screen2"
Screen "Screen1" RIghtOf "Screen0"
EndSection
Section "Module"
Load "modesetting"
Load "glx"
EndSection
Section "Monitor"
Identifier "Nvidia"
VendorName "Unknown"
ModelName "Unknown"
Option "DPMS"
EndSection
Section "Monitor"
Identifier "Intel"
VendorName "Unknown"
ModelName "Unknown"
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GeForce GTX 1080 Ti"
BusID "PCI:1:0:0"
EndSection
Section "Device"
Identifier "Device1"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GeForce GTX 1080 Ti"
BusID "PCI:2:0:0"
EndSection
Section "Device"
Identifier "Device2"
Driver "modesetting"
BusID "PCI:0:2:0"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Nvidia"
DefaultDepth 24
Option "AllowEmptyInitialConfiguration" "True"
Option "Coolbits" "28"
SubSection "Display"
Depth 24
EndSubSection
EndSection
Section "Screen"
Identifier "Screen1"
Device "Device1"
Monitor "Nvidia"
DefaultDepth 24
Option "AllowEmptyInitialConfiguration" "True"
Option "Coolbits" "28"
SubSection "Display"
Depth 24
EndSubSection
EndSection
Section "Screen"
Identifier "Screen2"
Device "Device2"
Monitor "Intel"
DefaultDepth 24
SubSection "Display"
Depth 24
Modes "1920x1080"
EndSubSection
EndSection
Note that my monitor is always connected to iGPU output, not N-GPU port, so NVIDIA cards are not physically attached to the monitor in both situations
Let's call the first situation I mentioned as situation 1, and the second situation 2. If I comment out Screen "Screen2"
, which is line 3, it's situation 1 in which desktop runs on N-GPU, use optimus to output to iGPU and nvfancontrol works. have Screen "Screen2"
corresponds to situation 2 in which desktop runs on iGPU. But as said above, nvfancontrol doesn't work anymore
And under situation 2, I tried plugging a monitor to N-GPU output port. Now I can see this monitor in NV X server settings(the one connecting to iGPU, and visible in situation 1 still not visible here). nvfancontrol still not working.
Till now, seems the main difference is, my shell is running on N-GPU in situation 1(as shown in nvidia-smi
, the plasmashell
) and iGPU under situation 2.
Any clues about what I can do? I know my case is not very common. so if its hard to deal with, I may wrtie a simple bash script to read temperature periodically and adjust fanspeed, since I can still use nvidia-setting
to do this
Thanks for your reply
It's odd, because it should work. Everything seems in place. Can you try running nvfancontrol -g 0 -m -d
to see if we can get any meaningful output? Do you get the current state of the coolers on the first GPU? Please use the standard release, not the debug build from the other comment.
I'm always using the latest release except for that test. Following output is from situation 1
> nvfancontrol -g 0 -m -d
INFO - Loading configuration file: "/home/asteray/.config/nvfancontrol.conf"
DEBUG - Curve points: [(25, 20), (35, 30), (45, 40), (55, 60), (65, 80), (75, 100)]
INFO - NVIDIA driver version: 460.73.01
INFO - NVIDIA graphics adapter #0: GeForce GTX 1080 Ti
INFO - GPU #0 coolers: COOLER-0
INFO - NVIDIA graphics adapter #1: GeForce GTX 1080 Ti
INFO - GPU #1 coolers: COOLER-1
INFO - Option "-m" is present; curve will have no actual effect
DEBUG - Temp: 38; Speed: [1103] RPM ([23]%); Load: 0%; Mode: Auto
Following output is from situation 2
> nvfancontrol -g 0 -m -d
INFO - Loading configuration file: "/home/asteray/.config/nvfancontrol.conf"
DEBUG - Curve points: [(25, 20), (35, 30), (45, 40), (55, 60), (65, 80), (75, 100)]
X Error of failed request: BadMatch (invalid parameter attributes)
Major opcode of failed request: 156 (NV-CONTROL)
Minor opcode of failed request: 4 ()
Serial number of failed request: 18
Current serial number in output stream: 18
It's almost the same as what the out-dated debug build gave (see my first post)
Currently I've wrote a simple script based on nvidia-smi
and nvidia-settings
which read the temp and set fan speed every 5 secs, so it might excluded possible issue of my hardware not supporting fanspeed control under situation 2.
OK there seems to be something fundamentally wrong because as you can see we can't even get the version number of the driver (the NVIDIA driver version: ....
line doesn't appear at all). According to XNVCtrl documentation BadMatch
is returned when there's no driver on the specific Screen (not display). The only function that calls into a function needing a screen attribute is when we are querying the driver version. We are only querying screen 0. I'm wondering if increasing the screen number in XNVCTRLQuery[XXX]Attribute
will solve the problem. If you can recompile the software yourself can you check if changing the first zero of the XNVCTRLQueryStringAttribute
call to 1 or 2 solves your problem? If not I'll put out a test build later today.
I tried to build it, but to installing the library it needs brings tons of other things. I'm afraid it might mess up my current devel environment. So it's good if you can provide the test build. Thanks !
That's alright. I think I know what's wrong. I'll put out a test build later today.
Can you try this one please?
Run it initially with nvfancontrol -g 0 -d -m
and you should get something along these lines
...
Found NVidia screen 0
INFO - NVIDIA driver version: 465.24.02
...
Try the same with -g 1
for the other GPU. If that looks well try running nvfancontrol
as usual with the -d
flag.
Thanks! This build works well under situation 2
The following is the output
> ./nvfancontrol -g 0 -d -m
INFO - Loading configuration file: "/home/asteray/.config/nvfancontrol.conf"
DEBUG - Curve points: [(25, 20), (35, 30), (45, 40), (55, 60), (65, 80), (75, 100)]
Found NVidia screen 1
Found NVidia screen 1
INFO - NVIDIA driver version: 460.73.01
INFO - NVIDIA graphics adapter #0: GeForce GTX 1080 Ti
INFO - GPU #0 coolers: COOLER-0
INFO - NVIDIA graphics adapter #1: GeForce GTX 1080 Ti
INFO - GPU #1 coolers: COOLER-1
INFO - Option "-m" is present; curve will have no actual effect
DEBUG - Temp: 28; Speed: [1101] RPM ([23]%); Load: 0%; Mode: Manual
DEBUG - Temp: 28; Speed: [1100] RPM ([23]%); Load: 0%; Mode: Manual
Using parameter -g 1
has the save output as above, Cuz both screen has the same name NVidia screen 1?
Then I tried removing both -d -m
, and launched a CUDA program, the fanspeed control was working as desired on both GPUs
Thanks again for your work, and this issue could be closed
Yay! Good to know it worked. I'll massage the code a bit and put a new release out. Thank you SO much for reporting this.