cuInit breaks kernel logging with CC-on

Question

cuInit breaks kernel logging with CC-on

derpsteb opened this issue 6 months ago · comments

Otto Bittner commented 6 months ago

NVIDIA Open GPU Kernel Modules Version

535.129.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.2 LTS

Kernel Release

Linux guest 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

I am running on a stable kernel release.

Hardware: GPU

NVIDIA H100 PCIe (UUID: GPU-6ba7987d-72f9-8f1b-66e8-ddba68758dd3)

Describe the bug

I am missing seemingly all kernel logs after booting into a AMD SEV-SNP VM with an H100 in CC-mode attached. This means I see a lot of NVRM debug logs before the login prompt appears on my serial console. Even logs that I added myself. But after I ssh into the machine, no further logs are created. Neither from the driver, nor from running echo "foo" > /dev/kmsg.

The behavior happens only when starting the VM as a AMD SEV-SNP VM, not if started as a normal VM. Between changing the VM type no other configuration on the disc is changed.

To Reproduce

EDIT: the most minimal reproduction I could come up with is just executing cuInit. Example can be found here.

In a AMD SEV-SNP VM:

modify /etc/sysctl.conf to contain kernel.printk = 7 4 1 7
modify /etc/modprobe.d/nvidia-lkca.conf to contain install nvidia /sbin/modprobe ecdsa_generic ecdh; /sbin/modprobe --ignore-install nvidia NVreg_RmMsg=":"
reboot
git clone git@github.com:NVIDIA/open-gpu-kernel-modules.git
cd open-gpu-kernel-modules
git checkout 535.129.03
make modules -j$(nproc) NV_VERBOSE=1 DEBUG=1
sudo make modules_install -j$(nproc)
run echo "foo" > /dev/kmsg

There is a lot of output from the kernel module during boot.
There is no output after booting is finished, including "foo" not being printed.
Running nvidia-smi would also trigger multiple prints in the normal VM. It does not in the confidential VM.

Bug Incidence

Always

nvidia-bug-report.log.gz

More Info

I expected to see the same logging behavior in a confidential VM as I am seeing in a normal VM.

I can't confirm that this is not happening with the proprietary driver as the Nvidia deployment guide for confidential computing explicitly mentions to use the OSS driver, not the proprietary one. I also wouldn't know how to manipulate the logging behavior of the proprietary driver.

I am happy to try any debugging steps you can suggest.

Otto Bittner · Answer 1 · Sat Jan 20 2024 01:55:01 GMT+0800 (China Standard Time)

I found out that the issue only starts once I connect to the GPU for the first time. So assuming nvidia-persistenced is not running I can reboot the VM and kernel logs work as expected. After running nvidia-smi for the first time the logs are broken.

Yifan Tan · Answer 2 · Sat Jan 20 2024 14:21:08 GMT+0800 (China Standard Time)

According to https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html:

Confidential Compute applications will not work on this release. Please continue to use 535.104.05 for this use case.

Could you try again using 535.104.05?

Otto Bittner · Answer 3 · Mon Jan 22 2024 16:14:28 GMT+0800 (China Standard Time)

I tried. Problem persists. :/
Also changed the kernel to match exactly with what the docs require here. So I am running 6.2.0-26 instead of 6.2.0-37 now.

Yifan Tan · Answer 4 · Mon Jan 22 2024 16:32:07 GMT+0800 (China Standard Time)

Please check:

Does the disk for the VM have enough space?
Could you build the nvidia driver without NV_VERBOSE=1 DEBUG=1?

Otto Bittner · Answer 5 · Mon Jan 22 2024 17:39:49 GMT+0800 (China Standard Time)

Yes.
The problem also exists with an unmodified driver from https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run

I can reproduce the error by running a minimal program that just executes cuInit.

Yifan Tan · Answer 6 · Mon Jan 22 2024 18:11:15 GMT+0800 (China Standard Time)

Could you find the broken log at /var/log/ ?

Also check nvidia-smi conf-compute -f and nvidia-smi conf-compute -gc.

By the way, would the cuda application fail? Or, only the dmesg is broken?

Otto Bittner · Answer 7 · Mon Jan 22 2024 18:44:48 GMT+0800 (China Standard Time)

Okay. It seems like using NVreg_RmMsg=":" as described here overwhelms the kernel logging subsystem if one is using CC-mode.

When configuring the module to only print warnings the logs continue to work as expected. Even with CC-on. I guess something in the CC-only codepaths produces a prohibitive amount of logs for the kernel to handle.

To answer your questions: the cuda application continues to work. Both smi commands print the expected output. There are no logs in /var/log/kernel if the logging is overwhelmed.