(3.8.0-later) ParallelCluster official AMIs incompatibility issue with P3, P2, G3 and G2 instances

Question

(3.8.0-later) ParallelCluster official AMIs incompatibility issue with P3, P2, G3 and G2 instances

enrico-usai opened this issue 5 months ago · comments

Bug description

The Linux Kernel community introduced a change that is incompatible with EFA and Nvidia drivers. This change has propagated to recent releases of Linux distributions including Amazon Linux. When using instance types with GPUDirect RDMA (the option to write/read directly from the EFA device to the GPU memory), EFA kernel module is unable to retrieve GPU memory information.

Nvidia introduced a open-source (OSS) version of their drivers, known as OpenRM, that is compatible with this kernel change. EFA released a new version, 1.29.0, that is compatible with recent kernels and with OSS Nvidia driver.

Starting from ParallelCluster 3.8.0 we installed the OSS Nvidia drivers and EFA 1.29.0, as default in all official ParallelCluster AMIs, to permit the customers to use recent kernels and safely ingest security fixes.

Unfortunately, OSS Nvidia drivers can only be used on any Turing, Ampere, Hopper or later GPU. Full list of compatible GPUs is available here. This means that P3, P2, G3 and G2 instances are no longer supported with official ParallelCluster 3.8.0+ AMIs.

The bootstrap of the instance will fail and you can find in the logs an error like the following:

[2024-01-10T18:34:22+00:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: gdrcopy[Configure gdrcopy] (aws-parallelcluster-platform::nvidia_config line 22) had an error: Mixlib::ShellOut::ShellCommandFailed: service[gdrcopy] (aws-parallelcluster-platform::nvidia_config line 103) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'init-output
---- Begin output of ["/bin/systemctl", "--system", "start", "gdrcopy"] ----
STDOUT:
STDERR: Job for gdrcopy.service failed because the control process exited with error code. See "systemctl status gdrcopy.service" and "journalctl -xe" for details.
---- End output of ["/bin/systemctl", "--system", "start", "gdrcopy"] ----
Ran ["/bin/systemctl", "--system", "start", "gdrcopy"] returned
+ error_exit 'Failed to run bootstrap recipes. If --norollback was specified, check /var/log/cfn-init.log and /var/log/cloud-init-output.log.'
+ echo 'Bootstrap failed with error: Failed to run bootstrap recipes. If --norollback was specified, check /var/log/cfn-init.log and /var/log/cloud-init-output.log.'
Bootstrap failed with error: Failed to run bootstrap recipes. If --norollback was specified, check /var/log/cfn-init.log and /var/log/cloud-init-output.log.
+ sleep 10

Affected versions

Instances with a GPU that is not a Turing, Ampere, Hopper or later GPU, with ParallelCluster >= 3.8.0 official AMIs, are unable to bootstrap. This means P3, P2, G3 and G2 instances are not supported by official ParallelCluster AMIs.
Full list of GPUs not impacted by the issue is available here.

If you’re using ParallelCluster 3.8.0 and you’re using an instance type with GPUDirect RDMA and EFA (like P4 and P5) you’re not affected by any issue.

Mitigation

If you want to use P3, P2, G3 or G2 instance types in ParallelCluster 3.8.0, you need to build your own custom AMI.

For ParallelCluster == 3.8.0 Please follow How to build a custom AMI with Closed Source Nvidia drivers for P3, P2, G3 and G2
For ParallelCluster > 3.8.0 Please follow How to build a custom AMI with Closed Source Nvidia drivers for P3, P2, G3 and G2

Max Burian · Answer 1 · Fri Mar 01 2024 20:45:03 GMT+0800 (China Standard Time)

Hi @enrico-usai ,

Thanks for the detailed description on this grinding topic. The custom AMI work around is really not great and has already lead to a range of interactions with the official AWS support. I would highly encourage that AWS tries to solve this in one of the upcoming PC-native AMIs. So thumbs up from my side!