NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reduced performance and visual bugs since 545.29.06

proJM-Coding opened this issue · comments

NVIDIA Open GPU Kernel Modules Version

545.29.06-3

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Description: Arch Linux

Kernel Release

Linux archlinux 6.7.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 05 Feb 2024 22:07:49 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce GTX 1660 SUPER (UUID: GPU-7561320d-6c18-1448-1ced-be2ae065c34a)

Describe the bug

Hello! I have noticed since 545.29.06 I have been having visual bugs in counter strike 2 and sometimes other games. For the time being I have just used the old driver (535.113.01) I have also noticed performance being worse since. Before I was able to get an average of 170 fps in games like counter strike and the finals but now I'm at 120 fps. The visual bugs can be seen in the video were the hud flickers. I have also noticed a frame gets displayed that was quite some time back. Mangohud also flickers. None of these problems existed in 535.113.01. Please see my video and neofetch for more info

Driver.bug.mp4

Neofetch:

                   -`                     : projm@archlinux 
                  .o+`                   ┌────────────────────────────────────┐ 
                 `ooo/                   󰣇 : Arch Linux 
                `+oooo:                   : AMD Ryzen 5 5600X (12) @ 3.700GHz 
               `+oooooo:                  : NVIDIA GeForce GTX 1660 SUPER 
               -+oooooo+:                 : 6173MiB / 15915MiB 
             `/:-:++oooo+:                : Hyprland 
            `/++++/+++++++:               : kitty 
           `/++++++++++++++:              : 27 mins 
          `/+++ooooooooooooo/`            : 1084 (pacman) 
         ./ooosssso++osssssso+`          󰒓 : 6.7.4-arch1-1 
        .oossssso-````/ossssss+`         󰹑 : 1024x768 
       -osssssso.      :ssssssso.         : Old-Gold-GTK-3 [GTK2/3] 
      :osssssss/        osssso+++.        : old-gold-icons [GTK2/3] 
     /ossssssss/        +ssssooo/-       └────────────────────────────────────┘ 
   `/ossssso+/:-        -:/+osssso+-                             
  `+sso+:-`                 `.-/+oso:
 `++:.                           `-/+/
 .`                                 `/

To Reproduce

  1. Open counter strike 2 with mangohud (mangohud %command%, please note the bug is still there without mangohud)
  2. Notice the menu bugging straight away.
  3. Start a match and notice the hud bugging out.
  4. Once the round is complete hold tab to notice more of the bug

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

It looks like the bug still exist in the proprietary drivers but is amplified in the open package. I have noticed that the visual bugs have been reduced since 550.54.14. Reduced performance is still a problem though. To make this clear 535.113.01 had better performance than 545.29.06 and 550.54.14.

Hey there, thanks for the report! And sorry for the late response! You wrote:

(mangohud %command%, please note the bug is still there without mangohud)

Can you please clarify if this applies to the performance issue as well as visual bugs? We're looking into a possible performance regression with tools that collect GPU stats that could have started with 545.xx. nvidia-bug-report.log doesn't show if csgo was started with mangohud or not.

Also, do you know if the performance issue exists in other games as well?

I have a only a few games that struggle with my gpu. Counter strike and the finals both experienced this performance drop so it's not a update of the game that has caused this. Both games went from around 170 fps to 120 fps (I have targeted my fps to my 170hz monitor but it's not vsync). The only other game is no man's sky, I'll test that later. The visual effects and performance problems are still there even if mangohud isn't used.

Also consider the fact that it could be an issue with the user-space drivers, not the kernel drivers. Make sure you report this to NVIDIA directly and not just through here

@proJM-Coding would you indulge my hunch and try setting __NVML_USE_RUSD=0 globally (e.g. in /etc/environment)? Then reboot the machine, and see if there's any change to the performance?

Be sure to clear the entry afterwards, even if it seems to fix the perf - it's not really a tested configuration and is liable to cause much longer hangs in some scenarios. But, if it ends up restoring old performance, then we have a pretty good idea of the root cause and know how to proceed.

EDIT: Or, alternatively/better-yet, try this patch in the kernel driver:

diff --git a/src/nvidia/src/kernel/gpu/gpu_user_shared_data.c b/src/nvidia/src/kernel/gpu/gpu_user_shared_data.c
index e2929fe..e0c2900 100644
--- a/src/nvidia/src/kernel/gpu/gpu_user_shared_data.c
+++ b/src/nvidia/src/kernel/gpu/gpu_user_shared_data.c
@@ -60,7 +60,7 @@ gpushareddataConstruct_IMPL
     // RUSD polling temporarily disabled on non-GSP due to collisions with VSYNC interrupt
     // on high refresh rate monitors. See Bug 4432698.
     //
-    if (!IS_GSP_CLIENT(pGpu) && (pAllocParams->polledDataMask != 0U))
+    if (pAllocParams->polledDataMask != 0U)
         return NV_ERR_NOT_SUPPORTED;
 
     if (RS_IS_COPY_CTOR(pParams))

Setting __NVML_USE_RUSD to 0 fixed performance issues for me on my 4070 ti super. It's a bad benchmark but before, I would get around 8000 fps on glxgears with 100% usage, now I get around 17000 fps on glxgears. I also get much higher fps in games (around double), however I still have occasional stuttering and graphical glitches in VR although this may be a bug in SteamVR.

Thanks for the datapoint @gamingdoom . Could you please run nvidia-bug-report.sh and attach? This setting should only have an effect if there is a long running monitoring process, such as mangohud - are you using anything like that?

@proJM-Coding would you indulge my hunch and try setting __NVML_USE_RUSD=0 globally (e.g. in /etc/environment)? Then reboot the machine, and see if there's any change to the performance?

Be sure to clear the entry afterwards, even if it seems to fix the perf - it's not really a tested configuration and is liable to cause much longer hangs in some scenarios. But, if it ends up restoring old performance, then we have a pretty good idea of the root cause and know how to proceed.

EDIT: Or, alternatively/better-yet, try this patch in the kernel driver:

diff --git a/src/nvidia/src/kernel/gpu/gpu_user_shared_data.c b/src/nvidia/src/kernel/gpu/gpu_user_shared_data.c
index e2929fe..e0c2900 100644
--- a/src/nvidia/src/kernel/gpu/gpu_user_shared_data.c
+++ b/src/nvidia/src/kernel/gpu/gpu_user_shared_data.c
@@ -60,7 +60,7 @@ gpushareddataConstruct_IMPL
     // RUSD polling temporarily disabled on non-GSP due to collisions with VSYNC interrupt
     // on high refresh rate monitors. See Bug 4432698.
     //
-    if (!IS_GSP_CLIENT(pGpu) && (pAllocParams->polledDataMask != 0U))
+    if (pAllocParams->polledDataMask != 0U)
         return NV_ERR_NOT_SUPPORTED;
 
     if (RS_IS_COPY_CTOR(pParams))

I found no performance difference in counter strike. In the finals the was a performance difference of about 5 fps. I don't think this is related to the setting though. Very interesting how @gamingdoom has way more fps though.

I found no performance difference in counter strike.

@proJM-Coding my apologies, I didn't realize at the time, but that toggle does nothing on the 1660.

However, I'm still unconvinced that your issue is unrelated, the release timings fit exactly into the hypothesis. This could be related to the Xorg process I see in your original nvidia-bug-report.log. Could you perhaps try adding option "UseRUSDMapping" "False" to the xorg.conf file under Section "Screen"? If you don't have such a file, sudo nvidia-xconfig should create one.

Thanks!

Thanks for the datapoint @gamingdoom . Could you please run nvidia-bug-report.sh and attach? This setting should only have an effect if there is a long running monitoring process, such as mangohud - are you using anything like that?

@mtijanic I was using mangohud on all games I tested. However, I wasn't using it for glxgears (although glxgears prints fps to stdout). I removed the variable to test again without it, the issue was gone, although when I ran glxgears with mangohud, I was getting ~7-8k fps instead of the normal ~16-17k fps on my 4070ti super (144hz display). On my 2060 super, this bug also seems to exist and I get ~9k fps instead of ~30k (60hz display). I think that maybe I misread the output of glxgears or had something else running the first time I tested glxgears in my previous post. Setting the __NVML_USE_RUSD variable fixes the fps reductions with mangohud.

However, I had upgraded recently from a 2060 super to a 4070 ti super and I was a little bit suspicious of the performance as I felt like it was actually worse. The 4070ti super only seems to work properly on >= 550 drivers so I used those. I did some tests on the 2060 super and the 4070 ti super and I think that there is another bug that causes the performance of the 4070 ti super to be worse than the 2060 super. On the 2060 super, I get around ~30k fps on glxgears whereas I get ~17k fps on glxgears on the 4070 ti super. With cs2 max settings (no mangohud, used in game fps counter), I got around 150 fps on the 2060 super (which I feel like is a lot less than what I got before I upgraded and switched to 545+ drivers) and ~110 fps on the 4070 ti super. There is definitely something wrong as the 4070ti super was at 100% usage and the 4070ti super computer is also a little better in terms of CPU power. These issues happen on both the proprietary and open source kernel modules. dmesg doesn't seem to show anything. The 4070ti super is also clocking up to higher frequency. I have reported this issue to linux-bugs@nvidia.com too.

Here are the outputs of nvidia-bug-report.sh from the 2060 super and 4070ti super computers:
nvidia-bug-report_4070_ti_super.log
nvidia-bug-report_2060_super.log

Computer Specs:
4070 ti super:
Distro: Arch Linux
Kernel: 6.8.0-arch1-1
CPU: Ryzen 7 5800X
GPU: NVIDIA GeForce RTX 4070 Ti SUPER
RAM: 48gb

2060 super:
Distro: Arch Linux
Kernel: 6.6.21-1-lts
CPU: Ryzen 7 5700X
GPU: NVIDIA GeForce RTX 2060 SUPER
RAM: 64gb

Thank you @gamingdoom , that's a lot of very useful info. It'll take some time to digest all that and figure out the next steps; will also try harder to reproduce your issues.

Anyway, I owe you a bit of explanation about all this: Programs like mangohud use libnvidia-ml.so (NVML) or similar nvidia libs to query information about the GPU (utilization %, thermals, power draw, etc). This information generally lives on GSP, PMU, etc, and needs to be extracted. This is usually no big deal - it's fast enough, and the game doesn't need to talk to those microcontrollers much - but sometimes if you are just unlucky enough you might hit a case where the game needs something from there at the same time and needs to wait on mangohud to finish. This can cause a spike in frame time (microstutter).

With 545.xx we exposed this information through a different interface (RUSD) that doesn't have the same issue, but since it's a new feature it's totally plausible that it is causing issues in a different manner. This is what I'm trying to narrow down - is it RUSD related at all. Your report certainly suggests so.

__NVML_USE_RUSD=0 will restore the old behavior for NVML, but this is not a configuration we are actively testing anymore, so I can't recommend it for actual usage. Also, for reasons, it currently only used on Ampere+ (RTX 3000 and later), so on your 2060 and OP's 1660 this toggle does nothing and it's always zero.

Now, the thing about RUSD is that it's global - if any program uses it, it's active. Outside of NVML, Xorg can also use it, unless option "UseRUSDMapping" "False". This is true on older GPUs as well. So, I have another favor to ask here, just so we can know for sure if it is active or not:

Set NVreg_RmMsg="rmapi" to enable more verbose prints about related things, then post dmesg output (or nvidia-bug-report.log), for both systems.

And lastly, can you clarify here:

These issues happen on both the proprietary and open source kernel modules.

If the proprietary systems are using GSP offload or not (can check with nvidia-smi -q | grep GSP, if it says N/A, then it's not offloaded)? And exactly which version of the proprietary driver you tested?

--

Again, thank you so much for helping us track this down! It's unfortunately not a thing that's easily reproduced (or we'd have caught it in automated testing and/or QA), so your help here is invaluable.

Hello @mtijanic, here are the results from my systems:

Set NVreg_RmMsg="rmapi" to enable more verbose prints about related things, then post dmesg output (or nvidia-bug-report.log), for both systems.

nvidia-bug-report.log and journalctl -k -r -b 0 outputs (dmesg didn't capture everything) for both systems:
2060 super:
journalctl_2060_super.log
nvidia-bug-report_2060_super.log

4070ti super:
journalctl_4070ti_super.log
nvidia-bug-report_4070ti_super.log

If the proprietary systems are using GSP offload or not (can check with nvidia-smi -q | grep GSP, if it says N/A, then it's not offloaded)?

On the 4070ti super system, and proprietary kernel modules, I get
GSP Firmware Version : N/A
On the 2060 super system and proprietary kernel modules, also I get
GSP Firmware Version : N/A

I don't know if it should, but /lib/firmware/nvidia/550.54.14 only has gsp_tu10x.bin and gsp_ga10x.bin and not gsp_ad10x.bin and there are ad103 and tu106 directories in /lib/firmware/nvidia. But inside the gsp subdirectories of ad103 and tu106, the file names reference driver version 535.113.01. The symbol names in the gsp-535.113.01.bin files (found using objdump -t gsp-535.113.01.bin) also contain rel_gpu_drv_r535_r537_41. However, the gsp_tu10x.bin and gsp_ga10x.bin symbol names contain rel_gpu_drv_r550_r551_40. Again, I am not sure if this is how things should be and with proprietary driver 535.113.01, the 2060 super still had N/A for GSP firmware version.

And exactly which version of the proprietary driver you tested?

Both systems were using proprietary version 550.54.14 and open source version 550.54.14.

Edit:
I just went to play a game and the performance doubling was gone.

@gamingdoom Thanks for these! Just to confirm, the logs are with __NVML_USE_RUSD=0 (at least on the 4070)?

@mtijanic I don't remember so here is a log on the 4070ti super using the env variable and the options nvidia NVreg_RmMsg="rmapi":
nvidia-bug-report.log.gz

Thank you @gamingdoom , that's a lot of very useful info. It'll take some time to digest all that and figure out the next steps; will also try harder to reproduce your issues.

Anyway, I owe you a bit of explanation about all this: Programs like mangohud use libnvidia-ml.so (NVML) or similar nvidia libs to query information about the GPU (utilization %, thermals, power draw, etc). This information generally lives on GSP, PMU, etc, and needs to be extracted. This is usually no big deal - it's fast enough, and the game doesn't need to talk to those microcontrollers much - but sometimes if you are just unlucky enough you might hit a case where the game needs something from there at the same time and needs to wait on mangohud to finish. This can cause a spike in frame time (microstutter).

With 545.xx we exposed this information through a different interface (RUSD) that doesn't have the same issue, but since it's a new feature it's totally plausible that it is causing issues in a different manner. This is what I'm trying to narrow down - is it RUSD related at all. Your report certainly suggests so.

__NVML_USE_RUSD=0 will restore the old behavior for NVML, but this is not a configuration we are actively testing anymore, so I can't recommend it for actual usage. Also, for reasons, it currently only used on Ampere+ (RTX 3000 and later), so on your 2060 and OP's 1660 this toggle does nothing and it's always zero.

Now, the thing about RUSD is that it's global - if any program uses it, it's active. Outside of NVML, Xorg can also use it, unless option "UseRUSDMapping" "False". This is true on older GPUs as well. So, I have another favor to ask here, just so we can know for sure if it is active or not:

Set NVreg_RmMsg="rmapi" to enable more verbose prints about related things, then post dmesg output (or nvidia-bug-report.log), for both systems.

And lastly, can you clarify here:

These issues happen on both the proprietary and open source kernel modules.

If the proprietary systems are using GSP offload or not (can check with nvidia-smi -q | grep GSP, if it says N/A, then it's not offloaded)? And exactly which version of the proprietary driver you tested?

--

Again, thank you so much for helping us track this down! It's unfortunately not a thing that's easily reproduced (or we'd have caught it in automated testing and/or QA), so your help here is invaluable.

I'll get to trying the xorg.conf file soon. You talked about microstutters but you can see in my video that the HUD is flickering in and out. This not a microstutter but a visual bug. One other visual bug I have have found hard to capture on video is random old frames being displayed. Say I'm holding a corner on counter strike, suddenly I'll get a frame of the start of the round then back to the corner. This is what I was talking about with visual bugs along with the other things I have mentioned.

@mtijanic After some more tests, It seems like sometimes the performance is better on the 4070 ti super (Arma Reforger) but usually it is worse than the 2060 Super (Counter-Strike 2, Euro Truck Simulator 2 VR, glxgears, vkcube, glmark2). I think it may also not actually be clocking up to the max. The highest graphics clock I have seen is 2790mhz but the maximum is 3105 according to nvidia xserver settings. The highest memory clock I have seen is 10501mhz but the maximum is 21002mhz according to nvidia xserver settings. The clock speeds not maxing out means that the card draws only 200w at the highest out of the max of 285w. However, according to nvidia-smi -q -d PERFORMANCE, all "Clocks Event Reasons" are "Not Active". The performance can also be very inconsistent between restarts. Sometimes it's good and sometimes it's bad with no changes between restarts.

I'll get to trying the xorg.conf file soon. You talked about microstutters but you can see in my video that the HUD is flickering in and out. This not a microstutter but a visual bug. One other visual bug I have have found hard to capture on video is random old frames being displayed. Say I'm holding a corner on counter strike, suddenly I'll get a frame of the start of the round then back to the corner. This is what I was talking about with visual bugs along with the other things I have mentioned.

I have been using that xorg config for a while and the problem still exists and performance is still not restored. I didn't show my envs before so here they are:

env = XCURSOR_SIZE,24
env = XDG_CURRENT_DESKTOP,Hyprland
env = XDG_SESSION_TYPE,wayland
env = XDG_SESSION_DESKTOP,Hyprland

env = QT_QPA_PLATFORMTHEME,qt5ct

env = WLR_NO_HARDWARE_CURSORS,1
env = LIBVA_DRIVER_NAME,nvidia

@mtijanic There seems to be a major issue with clock speeds. The gpu only clocks to ~1500mhz when it should be clocking to ~3000mhz when playing a GPU intensive game like Cyberpunk 2077. I should be getting around 60fps but I only got around 30-40fps. However, after restarting the game and my computer a few times (didn't change anything), I was actually able to get a little under 60fps since the gpu clocked to ~2700mhz but still not ~3000mhz so I'm not sure whats happening.

commented

I have done a clean install of artix linux with openRC and I still have the problem. Slime rancher and supertux also have the weird flickering, I'll try to get footage.

@mtijanic Hi there. I think I'm having this problem as well. I switched over to the open driver since it's going to be the default in the upcoming 560 release to get a head start. I'm definitely have a massive performance difference compared to the proprietary version.

When running nvidia-smi -q -d PERFORMANCE I see that Idle state is active. I have a shader wallpaper plugin which should not be putting the GPU into idle state. At "idle" my power usage is sitting around 40 to 50 W. It should be higher than that from running the shader in the background.

Next step was trying out a game. Dark Souls: Remastered with various ReShade shaders like global illumination put it from about 40% total load to 70% normally and around 190 to 220 W depending. With the open driver, I'm getting about half the FPS and power usage is sitting at 140.

The kicker is, I can "fix" it by going into Kwin settings and changing the display refresh rate to 30 Hz which looks more like 15 Hz, then back to 60 Hz and it starts working as normal. Power usage will max out under load and frame rate goes back to normal.

The other weird thing is I use a dual monitor setup. My main HDMI display is the one that appears to be running at half refresh rate but my aux DP display looks fine. Moving windows on the HDMI display looks terrible but moving windows on my DP display looks fine.

I don't think this is a Kwin issue because the only thing that changed was going from closed gpu driver to open gpu driver.

P.S. I tried your patch and also exporting the environment variable. No change. Going back to closed gpu driver is normal.

System info:
Operating System: Gentoo Linux 2.15
KDE Plasma Version: 6.0.4
KDE Frameworks Version: 6.1.0
Qt Version: 6.7.0
Kernel Version: 6.8.8-clang (64-bit)
Graphics Platform: Wayland
Processors: 16 × AMD Ryzen 9 3950X 16-Core Processor
Memory: 31.3 GiB of RAM
Graphics Processor: NVIDIA GeForce RTX 3070/PCIe/SSE2
Product Name: X570S PG Riptide

@proJM-Coding @gamingdoom can you guys try manually changing your monitor's refresh rates in whatever window manager you're using?

commented

To what exactly? 60hz? My monitor uses 170hz but I'll try to see if 60hz helps.

To what exactly? 60hz? My monitor uses 170hz but I'll try to see if 60hz helps.

Anything, just to reset it. If you're on 170, check nvidia-smi for your clock and power draw, then try 120 then back to 170. Then check nvidia-smi again to see if it uses the full power and clock speeds of the GPU.