Problem setting fanspeed

Question

Problem setting fanspeed

AxteRay opened this issue 3 years ago · comments

Hi, I have a strange problem here that I don't know how to fix, any help / clues will be appreciated

I have 2*1080Ti GPU, with one monitor plugged into iGPU. My intention is using GPU for CUDA and iGPU for display output. In xorg.conf, first I assigned a screen to each card respectively while setting coolbits, nvfancontrol works normally under this setting. But since the desktop environment is running on GPU:0 and use optimus (or something like that) to output to iGPU then to monitor, not only it takes up VRAM, but is very buggy as well. nvidia-smi looks like this, Desktop environment takes some VRAM.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 23%   31C    P8     8W / 250W |    204MiB / 11178MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 20%   23C    P8     7W / 250W |     28MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10376      G   /usr/bin/Xorg.bin                 121MiB |
|    0   N/A  N/A     10542      G   /usr/bin/kwin_x11                  38MiB |
|    0   N/A  N/A     10587      G   /usr/bin/plasmashell               37MiB |
|    1   N/A  N/A     10376      G   /usr/bin/Xorg.bin                  17MiB |
|    1   N/A  N/A     10545      G   /usr/bin/kwin_x11                   7MiB |
+-----------------------------------------------------------------------------+

So I add another screen in xorg.conf running on iGPU, which perfectly addressed the problems above, but nvfancontrol refuse to work now, while I can set fanspeed normally on NV X server settings. nvidia-smi looks like this, can see that xserver is running normally on both GPU but desktop environment and applications are not on it (on iGPU instead)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 23%   34C    P8     9W / 250W |     52MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 20%   26C    P8     8W / 250W |     52MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     13864      G   /usr/bin/Xorg.bin                  17MiB |
|    0   N/A  N/A     14173      G   /usr/bin/kwin_x11                  31MiB |
|    1   N/A  N/A     13864      G   /usr/bin/Xorg.bin                  17MiB |
|    1   N/A  N/A     14174      G   /usr/bin/kwin_x11                  31MiB |
+-----------------------------------------------------------------------------+

I Tried using the binary you posted here and got the following output:

NvidiaControl::init() opening display :0
NvidiaControl::init() display :0 opened successfully
NvidiaControl::init() counting GPUs
NvidiaControl::init() GPUs enumerated
get_version() enter
get_version() querying attribute CTRL_ATTR::NVIDIA_DRIVER_VERSION
X Error of failed request:  BadMatch (invalid parameter attributes)
  Major opcode of failed request:  156 (NV-CONTROL)
  Minor opcode of failed request:  4 ()
  Serial number of failed request:  14
  Current serial number in output stream:  14

Note that I still can set fanspeed normally on NV X server settings when this happens. And I noticed a difference in NV X server settings. In previous situation, I can see my monitor plus two Xscreen in NV X server settings -> Display Configuration. But I cannot see my monitor in latter one, only two Xscreen

I strongly avoid running desktop environment on Nvidia GPU, so I'm curious about what caused this and are there methods to fix this under the second situation.

Many thanks!

Spyros Stathopoulos · Answer 1 · Sat Apr 24 2021 00:05:50 GMT+0800 (China Standard Time)

Hi! That's an interesting problem you've got there. XNVCtrl needs a running X display to do anything, but I'm not sure if it needs to be physically attached to an NVIDIA card. FYI the debug build you tried to use won't work anymore so the error you're seeing is irrelevant. The output of nvfancontrol -p would be more useful in this case. Try cycling the $DISPLAY env variable to see if there's any difference in output.

AxteRay · Answer 2 · Sat Apr 24 2021 10:31:27 GMT+0800 (China Standard Time)

I've tried the following command, in both situations I mentioned above, they gave identical output

> nvfancontrol -p
Found 2 available GPU(s)
GPU #0: GeForce GTX 1080 Ti 
 COOLER-0
GPU #1: GeForce GTX 1080 Ti 
 COOLER-1

> echo $DISPLAY
:0

This is my xorg.conf, just for reference

Section "ServerLayout"
    Identifier     "Default Layout"
    Screen         "Screen2"
    Screen         "Screen0" RightOf "Screen2"
    Screen         "Screen1" RIghtOf "Screen0"
EndSection

Section "Module"
    Load           "modesetting"
    Load           "glx"
EndSection

Section "Monitor"
    Identifier     "Nvidia"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Monitor"
    Identifier     "Intel"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1080 Ti"
    BusID          "PCI:1:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GTX 1080 Ti"
    BusID          "PCI:2:0:0"
EndSection

Section "Device"
    Identifier     "Device2"
    Driver         "modesetting"
    BusID          "PCI:0:2:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Nvidia"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
    SubSection "Display"
        Depth      24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Nvidia"
    DefaultDepth    24
    Option         "AllowEmptyInitialConfiguration" "True"
    Option         "Coolbits" "28"
    SubSection "Display"
        Depth      24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen2"
    Device         "Device2"
    Monitor        "Intel"
    DefaultDepth    24
    SubSection "Display"
        Depth      24
        Modes      "1920x1080"
    EndSubSection
EndSection

Note that my monitor is always connected to iGPU output, not N-GPU port, so NVIDIA cards are not physically attached to the monitor in both situations

Let's call the first situation I mentioned as situation 1, and the second situation 2. If I comment out Screen "Screen2", which is line 3, it's situation 1 in which desktop runs on N-GPU, use optimus to output to iGPU and nvfancontrol works. have Screen "Screen2" corresponds to situation 2 in which desktop runs on iGPU. But as said above, nvfancontrol doesn't work anymore

And under situation 2, I tried plugging a monitor to N-GPU output port. Now I can see this monitor in NV X server settings(the one connecting to iGPU, and visible in situation 1 still not visible here). nvfancontrol still not working.

Till now, seems the main difference is, my shell is running on N-GPU in situation 1(as shown in nvidia-smi, the plasmashell) and iGPU under situation 2.

Any clues about what I can do? I know my case is not very common. so if its hard to deal with, I may wrtie a simple bash script to read temperature periodically and adjust fanspeed, since I can still use nvidia-setting to do this

Thanks for your reply

Spyros Stathopoulos · Answer 3 · Sat Apr 24 2021 17:34:10 GMT+0800 (China Standard Time)

It's odd, because it should work. Everything seems in place. Can you try running nvfancontrol -g 0 -m -d to see if we can get any meaningful output? Do you get the current state of the coolers on the first GPU? Please use the standard release, not the debug build from the other comment.

AxteRay · Answer 4 · Sat Apr 24 2021 20:49:19 GMT+0800 (China Standard Time)

I'm always using the latest release except for that test. Following output is from situation 1

> nvfancontrol -g 0 -m -d
INFO - Loading configuration file: "/home/asteray/.config/nvfancontrol.conf"
DEBUG - Curve points: [(25, 20), (35, 30), (45, 40), (55, 60), (65, 80), (75, 100)]
INFO - NVIDIA driver version: 460.73.01
INFO - NVIDIA graphics adapter #0: GeForce GTX 1080 Ti
INFO -   GPU #0 coolers: COOLER-0
INFO - NVIDIA graphics adapter #1: GeForce GTX 1080 Ti
INFO -   GPU #1 coolers: COOLER-1
INFO - Option "-m" is present; curve will have no actual effect
DEBUG - Temp: 38; Speed: [1103] RPM ([23]%); Load: 0%; Mode: Auto

Following output is from situation 2

> nvfancontrol -g 0 -m -d
INFO - Loading configuration file: "/home/asteray/.config/nvfancontrol.conf"
DEBUG - Curve points: [(25, 20), (35, 30), (45, 40), (55, 60), (65, 80), (75, 100)]
X Error of failed request:  BadMatch (invalid parameter attributes)
  Major opcode of failed request:  156 (NV-CONTROL)
  Minor opcode of failed request:  4 ()
  Serial number of failed request:  18
  Current serial number in output stream:  18

It's almost the same as what the out-dated debug build gave (see my first post)

Currently I've wrote a simple script based on nvidia-smi and nvidia-settings which read the temp and set fan speed every 5 secs, so it might excluded possible issue of my hardware not supporting fanspeed control under situation 2.

Spyros Stathopoulos · Answer 5 · Sat Apr 24 2021 21:03:13 GMT+0800 (China Standard Time)

OK there seems to be something fundamentally wrong because as you can see we can't even get the version number of the driver (the NVIDIA driver version: .... line doesn't appear at all). According to XNVCtrl documentation BadMatch is returned when there's no driver on the specific Screen (not display). The only function that calls into a function needing a screen attribute is when we are querying the driver version. We are only querying screen 0. I'm wondering if increasing the screen number in XNVCTRLQuery[XXX]Attribute will solve the problem. If you can recompile the software yourself can you check if changing the first zero of the XNVCTRLQueryStringAttribute call to 1 or 2 solves your problem? If not I'll put out a test build later today.

AxteRay · Answer 6 · Sat Apr 24 2021 21:40:58 GMT+0800 (China Standard Time)

I tried to build it, but to installing the library it needs brings tons of other things. I'm afraid it might mess up my current devel environment. So it's good if you can provide the test build. Thanks !

Spyros Stathopoulos · Answer 7 · Sat Apr 24 2021 22:05:49 GMT+0800 (China Standard Time)

That's alright. I think I know what's wrong. I'll put out a test build later today.

Spyros Stathopoulos · Answer 8 · Sun Apr 25 2021 00:28:08 GMT+0800 (China Standard Time)

Can you try this one please?

nvfancontrol.tar.gz

Run it initially with nvfancontrol -g 0 -d -m and you should get something along these lines

...
Found NVidia screen 0
INFO - NVIDIA driver version: 465.24.02
...

Try the same with -g 1 for the other GPU. If that looks well try running nvfancontrol as usual with the -d flag.

AxteRay · Answer 9 · Sun Apr 25 2021 09:40:28 GMT+0800 (China Standard Time)

Thanks! This build works well under situation 2

The following is the output

> ./nvfancontrol -g 0 -d -m
INFO - Loading configuration file: "/home/asteray/.config/nvfancontrol.conf"
DEBUG - Curve points: [(25, 20), (35, 30), (45, 40), (55, 60), (65, 80), (75, 100)]
Found NVidia screen 1
Found NVidia screen 1
INFO - NVIDIA driver version: 460.73.01
INFO - NVIDIA graphics adapter #0: GeForce GTX 1080 Ti
INFO -   GPU #0 coolers: COOLER-0
INFO - NVIDIA graphics adapter #1: GeForce GTX 1080 Ti
INFO -   GPU #1 coolers: COOLER-1
INFO - Option "-m" is present; curve will have no actual effect
DEBUG - Temp: 28; Speed: [1101] RPM ([23]%); Load: 0%; Mode: Manual
DEBUG - Temp: 28; Speed: [1100] RPM ([23]%); Load: 0%; Mode: Manual

Using parameter -g 1 has the save output as above, Cuz both screen has the same name NVidia screen 1?

Then I tried removing both -d -m, and launched a CUDA program, the fanspeed control was working as desired on both GPUs

Thanks again for your work, and this issue could be closed

Spyros Stathopoulos · Answer 10 · Sun Apr 25 2021 20:53:23 GMT+0800 (China Standard Time)

Yay! Good to know it worked. I'll massage the code a bit and put a new release out. Thank you SO much for reporting this.