cuInit breaks kernel logging with CC-on
derpsteb opened this issue · comments
NVIDIA Open GPU Kernel Modules Version
535.129.03
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Ubuntu 22.04.2 LTS
Kernel Release
Linux guest 6.2.0-37-generic #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 2 18:01:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
NVIDIA H100 PCIe (UUID: GPU-6ba7987d-72f9-8f1b-66e8-ddba68758dd3)
Describe the bug
I am missing seemingly all kernel logs after booting into a AMD SEV-SNP VM with an H100 in CC-mode attached. This means I see a lot of NVRM debug logs before the login prompt appears on my serial console. Even logs that I added myself. But after I ssh into the machine, no further logs are created. Neither from the driver, nor from running echo "foo" > /dev/kmsg
.
The behavior happens only when starting the VM as a AMD SEV-SNP VM, not if started as a normal VM. Between changing the VM type no other configuration on the disc is changed.
To Reproduce
EDIT: the most minimal reproduction I could come up with is just executing cuInit. Example can be found here.
In a AMD SEV-SNP VM:
- modify
/etc/sysctl.conf
to containkernel.printk = 7 4 1 7
- modify
/etc/modprobe.d/nvidia-lkca.conf
to containinstall nvidia /sbin/modprobe ecdsa_generic ecdh; /sbin/modprobe --ignore-install nvidia NVreg_RmMsg=":"
- reboot
git clone git@github.com:NVIDIA/open-gpu-kernel-modules.git
cd open-gpu-kernel-modules
git checkout 535.129.03
make modules -j$(nproc) NV_VERBOSE=1 DEBUG=1
sudo make modules_install -j$(nproc)
- run
echo "foo" > /dev/kmsg
There is a lot of output from the kernel module during boot.
There is no output after booting is finished, including "foo" not being printed.
Running nvidia-smi
would also trigger multiple prints in the normal VM. It does not in the confidential VM.
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
I expected to see the same logging behavior in a confidential VM as I am seeing in a normal VM.
I can't confirm that this is not happening with the proprietary driver as the Nvidia deployment guide for confidential computing explicitly mentions to use the OSS driver, not the proprietary one. I also wouldn't know how to manipulate the logging behavior of the proprietary driver.
I am happy to try any debugging steps you can suggest.
I found out that the issue only starts once I connect to the GPU for the first time. So assuming nvidia-persistenced is not running I can reboot the VM and kernel logs work as expected. After running nvidia-smi for the first time the logs are broken.
According to https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-129-03/index.html:
Confidential Compute applications will not work on this release. Please continue to use 535.104.05 for this use case.
Could you try again using 535.104.05?
I tried. Problem persists. :/
Also changed the kernel to match exactly with what the docs require here. So I am running 6.2.0-26 instead of 6.2.0-37 now.
Please check:
- Does the disk for the VM have enough space?
- Could you build the nvidia driver without
NV_VERBOSE=1 DEBUG=1
?
- Yes.
- The problem also exists with an unmodified driver from https://us.download.nvidia.com/tesla/535.104.05/NVIDIA-Linux-x86_64-535.104.05.run
I can reproduce the error by running a minimal program that just executes cuInit.
Could you find the broken log at /var/log/
?
Also check nvidia-smi conf-compute -f
and nvidia-smi conf-compute -gc
.
By the way, would the cuda application fail? Or, only the dmesg is broken?
Okay. It seems like using NVreg_RmMsg=":"
as described here overwhelms the kernel logging subsystem if one is using CC-mode.
When configuring the module to only print warnings the logs continue to work as expected. Even with CC-on. I guess something in the CC-only codepaths produces a prohibitive amount of logs for the kernel to handle.
To answer your questions: the cuda application continues to work. Both smi commands print the expected output. There are no logs in /var/log/kernel if the logging is overwhelmed.