NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

535.98 causes lockups

Lucretia opened this issue · comments

NVIDIA Open GPU Kernel Modules Version

535.98

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Gentoo Linux

Kernel Release

6.4.9-gentoo-x86_64

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3070 Ti (UUID: GPU-ce7d1b2c-cbfe-df2c-3de0-e688477e7043)

Describe the bug

i get random freezes when doing things and dmesg outputs this:

[Aug12 10:10] NVRM serverFreeResourceTree: hObject 0xbeef0400 not found for client 0xc1d00c24
[  +0.000031] NVRM serverFreeResourceTree: hObject 0xbeef0401 not found for client 0xc1d00c24
[  +0.000026] NVRM serverFreeResourceTree: hObject 0xbeef0402 not found for client 0xc1d00c24
[  +0.000025] NVRM serverFreeResourceTree: hObject 0xbeef0403 not found for client 0xc1d00c24
[Aug12 10:11] NVRM serverFreeResourceTree: hObject 0xbeef0400 not found for client 0xc1d0008e
[  +0.000032] NVRM serverFreeResourceTree: hObject 0xbeef0401 not found for client 0xc1d0008e
[  +0.000023] NVRM serverFreeResourceTree: hObject 0xbeef0402 not found for client 0xc1d0008e
[  +0.000023] NVRM serverFreeResourceTree: hObject 0xbeef0403 not found for client 0xc1d0008e



[Aug12 11:01] NVRM serverFreeResourceTree: hObject 0xbeef0400 not found for client 0xc1d0008e
[  +0.000142] NVRM serverFreeResourceTree: hObject 0xbeef0401 not found for client 0xc1d0008e
[  +0.000020] NVRM serverFreeResourceTree: hObject 0xbeef0402 not found for client 0xc1d0008e
[  +0.000018] NVRM serverFreeResourceTree: hObject 0xbeef0403 not found for client 0xc1d0008e


[Aug12 11:11] NVRM serverFreeResourceTree: hObject 0xbeef0400 not found for client 0xc1d0008e
[  +0.000034] NVRM serverFreeResourceTree: hObject 0xbeef0401 not found for client 0xc1d0008e
[  +0.000025] NVRM serverFreeResourceTree: hObject 0xbeef0402 not found for client 0xc1d0008e
[  +0.000026] NVRM serverFreeResourceTree: hObject 0xbeef0403 not found for client 0xc1d0008e

I found #272 which has similar dmesg's.

To Reproduce

emerg -av x11-drivers/nvidia-drivers

To install latest, reboot, login, experience.

Bug Incidence

Always

nvidia-bug-report.log.gz

I'll have to rebuild and do this later.

More Info

No response

We get the same error with 545.23.08 Open driver from ubuntu repo with 4090 gpu. During the spurious logs can hang nvidia-smi and other gpu operations.

We suspect the open driver or gsp firmware combo is calling cleanup code for server grade gpu such as A100 that does not exist for 4090 workstation cards. We get this error when we exit pytorch cuda operations for training/inference.

nvidia-kernel-open-545/unknown,now 545.23.08-0ubuntu
Dec  4 14:47:26 4090 kernel: [ 4919.880611] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008cb
Dec  4 14:47:26 4090 kernel: [ 4919.894364] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008cb
Dec  4 14:47:26 4090 kernel: [ 4919.898332] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008cb
Dec  4 14:48:52 4090 kernel: [ 5005.928372] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.932000] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.934586] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.936210] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.938738] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.940991] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.943227] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.945298] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.947319] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.949761] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.951603] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.953468] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.955246] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.959281] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.961257] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
Dec  4 14:48:52 4090 kernel: [ 5005.962814] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d008e6
`