aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(3.0.0-3.7.2) Nvidia Driver and EFA incompatibility issue when using P4 and P5 instances with updated Kernel drivers

enrico-usai opened this issue · comments

Bug description

The Linux Kernel community introduced a change that is incompatible with EFA and Nvidia drivers. This change has propagated to recent releases of Linux distributions including Amazon Linux. When using instance types with GPUDirect RDMA (the option to write/read directly from the EFA device to the GPU memory), EFA kernel module is unable to retrieve GPU memory information.

Nvidia introduced a open-source (OSS) version of their drivers, known as OpenRM, that is compatible with this kernel change. EFA released a new version, 1.29.0, that is compatible with recent kernels and with OSS Nvidia driver.

The use of P4 or P5 instance types, with a recently released Linux Kernel RPM, in combination with EFA and the non-OSS Nvidia drivers (ParallelCluster < 3.8.0) will cause the communication between your workload nodes (via EFA) to stop working.

If you’re using ParallelCluster 3.7.2 or an earlier version, and you’re using official ParallelCluster AMIs, you will be affected by the issue only if you update the kernel to a newer version, and if you use an instance type with GPUDirect RDMA and EFA (like P4 and P5).

In the logs you can find an error like the following:

`kernel: failing symbol_get of non-GPLONLY symbol nvidia_p2p_get_pages.`

Affected versions

P4 or P5 instance types in combination with EFA and the non-OSS Nvidia drivers (ParallelCluster <= 3.7.2) won’t work after updating Linux kernel starting with the following version numbers: 4.14.326, 5.4.257, 5.10.195, 5.15.131, 6.1.52.

How to check affected components

  • kernel version: uname -r
  • Installed Nvidia driver version: nvidia-smi
  • Nvidia license of the kernel: modinfo -F license nvidia, it will return Dual MIT/GPL or NVIDIA for the open source or closed driver, respectively
  • EFA installed version: cat /opt/amazon/efa_installed_packages | grep -E -o "EFA installer version: [0-9.]+"

Mitigation

Starting from ParallelCluster 3.8.0 we installed the OSS Nvidia drivers and EFA 1.29.0, as default in all official ParallelCluster AMIs, to permit the customers to use recent kernels and safely ingest security fixes.

To build a custom AMI for ParallelCluster <=3.7.2, with an updated kernel and with OSS Nvidia drivers, please follow How to create a custom AMI with Open Source Nvidia drivers for P4 and P5.

This issue can be closed since we already released ParallelCluster 3.8.0 with Open Source Nvidia Drivers and EFA 1.29.0.
This has been created to track this known issue and have official instructions to cope with it.