NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cuInit breaks kernel logging with CC-on

derpsteb opened this issue · comments

NVIDIA Open GPU Kernel Modules Version

535.129.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.2 LTS

Kernel Release

Linux guest 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA H100 PCIe (UUID: GPU-6ba7987d-72f9-8f1b-66e8-ddba68758dd3)

Describe the bug

I am missing seemingly all kernel logs after booting into a AMD SEV-SNP VM with an H100 in CC-mode attached. This means I see a lot of NVRM debug logs before the login prompt appears on my serial console. Even logs that I added myself. But after I ssh into the machine, no further logs are created. Neither from the driver, nor from running echo "foo" > /dev/kmsg.

The behavior happens only when starting the VM as a AMD SEV-SNP VM, not if started as a normal VM. Between changing the VM type no other configuration on the disc is changed.

To Reproduce

EDIT: the most minimal reproduction I could come up with is just executing cuInit. Example can be found here.

In a AMD SEV-SNP VM:

  • modify /etc/sysctl.conf to contain kernel.printk = 7 4 1 7
  • modify /etc/modprobe.d/nvidia-lkca.conf to contain install nvidia /sbin/modprobe ecdsa_generic ecdh; /sbin/modprobe --ignore-install nvidia NVreg_RmMsg=":"
  • reboot
  • git clone git@github.com:NVIDIA/open-gpu-kernel-modules.git
  • cd open-gpu-kernel-modules
  • git checkout 535.129.03
  • make modules -j$(nproc) NV_VERBOSE=1 DEBUG=1
  • sudo make modules_install -j$(nproc)
  • run echo "foo" > /dev/kmsg

There is a lot of output from the kernel module during boot.
There is no output after booting is finished, including "foo" not being printed.
Running nvidia-smi would also trigger multiple prints in the normal VM. It does not in the confidential VM.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

I expected to see the same logging behavior in a confidential VM as I am seeing in a normal VM.

I can't confirm that this is not happening with the proprietary driver as the Nvidia deployment guide for confidential computing explicitly mentions to use the OSS driver, not the proprietary one. I also wouldn't know how to manipulate the logging behavior of the proprietary driver.

I am happy to try any debugging steps you can suggest.

I found out that the issue only starts once I connect to the GPU for the first time. So assuming nvidia-persistenced is not running I can reboot the VM and kernel logs work as expected. After running nvidia-smi for the first time the logs are broken.

According to https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html:

Confidential Compute applications will not work on this release. Please continue to use 535.104.05 for this use case.

Could you try again using 535.104.05?

I tried. Problem persists. :/
Also changed the kernel to match exactly with what the docs require here. So I am running 6.2.0-26 instead of 6.2.0-37 now.

Please check:

  1. Does the disk for the VM have enough space?
  2. Could you build the nvidia driver without NV_VERBOSE=1 DEBUG=1?
  1. Yes.
  2. The problem also exists with an unmodified driver from https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run

I can reproduce the error by running a minimal program that just executes cuInit.

Could you find the broken log at /var/log/ ?

Also check nvidia-smi conf-compute -f and nvidia-smi conf-compute -gc.

By the way, would the cuda application fail? Or, only the dmesg is broken?

Okay. It seems like using NVreg_RmMsg=":" as described here overwhelms the kernel logging subsystem if one is using CC-mode.

When configuring the module to only print warnings the logs continue to work as expected. Even with CC-on. I guess something in the CC-only codepaths produces a prohibitive amount of logs for the kernel to handle.

To answer your questions: the cuda application continues to work. Both smi commands print the expected output. There are no logs in /var/log/kernel if the logging is overwhelmed.