NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0

Question

NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0

scaronni opened this issue 3 months ago · comments

Simone Caronni commented 3 months ago

NVIDIA Open GPU Kernel Modules Version

550.78

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Fedora 40

Kernel Release

6.8.7-300.fc40.x86_64

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4070 SUPER

Describe the bug

Kernel messages being spammed by these lines:

[ 6614.717414] NVRM: Xid (PCI:0000:01:00): 16, pid='<unknown>', name=<unknown>, Head 00000003 Count 0000f82a
[ 6614.717420] NVRM: krcWatchdogCallbackVblankRecovery_IMPL: NVRM-RC: RM has detected that 7 Seconds without a Vblank Counter Update on head:D0

After a few iterations of the two, it keeps spamming NVRM: krcWatchdogCallbackVblankRecovery_IMPL [...].

To Reproduce

Just boot the system with the open kernel modules installed.

Bug Incidence

Always

nvidia-bug-report.log.gz

More Info

No response

Thomas Luzat · Answer 1 · Tue May 21 2024 22:14:11 GMT+0800 (China Standard Time)

I can confirm that behavior with driver 550.54.15-1 (newest from CUDA repo), Debian unstable, custom-built kernels of at least versions 6.8.9, 6.9.0 and 6.9.1, a GeForce RTX 4090 and 5 displays connected. For me, it happens on head C0.

The displays are 2 DP screens (both running), 1 Valve Index VR headset on DP (not running), 1 HDMI screen (not connected to power), and 1 HDMI TV by LG (turned off or on).

The messages repeat very close to every 8.192s seconds and stop at some point (after ~40 minutes this time, not sure if consistent).

The error only occurs when the LG TV is a) connected and b) not enabled in X. Not sure if the message does indicate some actual problem, but I would prefer not to have my logs flooded with the message.

Milos Tijanic · Answer 2 · Tue May 21 2024 22:52:26 GMT+0800 (China Standard Time)

I believe this should be fixed with 555.42.02. This is the relevant change so you can apply it to 550.xx as well:

diff --git a/src/nvidia/src/kernel/gpu/disp/head/kernel_head.c b/src/nvidia/src/kernel/gpu/disp/head/kernel_head.c
index 50e14fa..5da4a43 100644
--- a/src/nvidia/src/kernel/gpu/disp/head/kernel_head.c
+++ b/src/nvidia/src/kernel/gpu/disp/head/kernel_head.c
@@ -235,7 +235,8 @@ kheadReadVblankIntrState_IMPL
 )
 {
     // Check to make sure that our SW state grooves with the HW state
-    if (kheadReadVblankIntrEnable_HAL(pGpu, pKernelHead))
+    if (kheadReadVblankIntrEnable_HAL(pGpu, pKernelHead) &&
+            kheadGetDisplayInitialized_HAL(pGpu, pKernelHead))
     {
         // HW is enabled, check if SW state is not enabled
         if (pKernelHead->Vblank.IntrState != NV_HEAD_VBLANK_INTR_ENABLED)

Thomas Luzat · Answer 3 · Wed May 22 2024 17:28:54 GMT+0800 (China Standard Time)

I believe this should be fixed with 555.42.02. This is the relevant change so you can apply it to 550.xx as well:

Thanks! I did not try to apply the patch, but the upgrade to 550.42.02, that is now packaged, fixes the issue for me.