NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

eGPU kernel modules failure - Chipset Setup Function Error!

KernelPryanic opened this issue · comments

NVIDIA Open GPU Kernel Modules Version

535.113.01

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Fedora release 38 (Thirty Eight)

Kernel Release

Linux fedora 6.5.8-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Oct 20 15:53:48 UTC 2023 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA Corporation GA104 [GeForce RTX 3070 Ti] (rev a1)

Describe the bug

I'm trying to use the open version of Nvidia driver because of RmInitAdapter failed! issue with the proprietary one, but I'm getting errors from the kernel. In the attached logs artifact it's around 7000 line.

Oct 31 17:53:41 fedora kernel: NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
Oct 31 17:53:44 fedora kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:52:00.0 on minor 1
Oct 31 17:53:44 fedora systemd[1]: nvidia-fallback.service - Fallback to nouveau as nvidia did not load was skipped because of an unmet condition check (ConditionPathExists=!/sys/module/nvidia).
Oct 31 17:54:06 fedora kernel: NVRM unixCallVideoBIOS: int10h(4f02, 0000) vesa call failed! (4f02, 0000)
Oct 31 17:54:06 fedora kernel: NVRM nvCheckOkFailedNoLog: Check failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from pRmApi->Control(pRmApi, nv->rmapi.hClient, nv->rmapi.hSubDevice, NV2080_CTRL_CMD_INTERNAL_DISPLAY_POST_RESTORE, &restoreParams, sizeof(restoreParams)) @ unix_console.c:197

Laptop specs:

  • Manufacturer: LENOVO
  • Product Name: 21CBCTO1WW
  • Version: ThinkPad X1 Carbon Gen 10
  • BIOS version: N3AET77W (1.42)

GRUB params:

GRUB_CMDLINE_LINUX="resume=/dev/mapper/vg--main-swap rd.luks.uuid=luks-dbbd65e4-65f3-4956-85f0-8d9e919e733c rd.lvm.lv=vg-main/root rd.lvm.lv=vg-main/swap rhgb quiet nvidia.NVreg_OpenRmEnableUnsupportedGpus=1 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1"

Related Nvidia thread: https://forums.developer.nvidia.com/t/driver-cant-detect-egpu/271201

To Reproduce

Boot the latest Fedora kernel with the open source driver and eGPU.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report-535-open.log.gz

More Info

eGPU works with Windows and with nouveau driver.

"Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver."

According to the forum you linked, this problem also occurs with the proprietary driver, correct?

@ttabi I cannot confirm that this issue also occurs with the proprietary driver, because the proprietary driver has the different issue RmInitAdapter failed! (0x26:0x56:1482).

I'm running into similar problem on my eGPU setup and tried to debug a little bit.

NVRM objClInitPcieChipset: *** Chipset Setup Function Error!

Is not fatal at all, so the real problem is:

NVRM unixCallVideoBIOS: int10h(4f02, 0000) vesa call failed! (4f02, 0000)
Oct 31 17:54:06 fedora kernel: NVRM nvCheckOkFailedNoLog: Check failed: Failure: Generic Error [NV_ERR_GENERIC] (0x0000FFFF) returned from pRmApi->Control(pRmApi, nv->rmapi.hClient, nv->rmapi.hSubDevice, NV2080_CTRL_CMD_INTERNAL_DISPLAY_POST_RESTORE, &restoreParams, sizeof(restoreParams)) @ unix_console.c:197

Which means emulated x86 call to VGABios failed.
I enabled #define IO_LOG(port, val) in vbioscall.c and it seems like it's trying to access legacy vga io/mem resource, which is broken on my eGPU platform.

My workaround is simply comment out primay_vga detection logic.

diff --git a/src/nvidia/arch/nvalloc/unix/src/dynamic-power.c b/src/nvidia/arch/nvalloc/unix/src/dynamic-power.c
index 934bff1..c6e97c2 100644
--- a/src/nvidia/arch/nvalloc/unix/src/dynamic-power.c
+++ b/src/nvidia/arch/nvalloc/unix/src/dynamic-power.c
@@ -951,11 +951,13 @@ void NV_API_CALL rm_init_dynamic_power_management(
     // Legacy case: check if device is primary and driven by VBIOS or fb driver.
     nv->primary_vga = NV_FALSE;
 
+#if 0
     //
     // Below function always return NV_OK and depends upon kernel flags
     // IORESOURCE_ROM_SHADOW & PCI_ROM_RESOURCE for Primary VGA detection.
     //
     nv_set_primary_vga_status(nv);
+#endif
 
     // UEFI case: where console is driven by GOP driver.
     bUefiConsole = rm_get_uefi_console_status(nv);