aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

regression issue: Pcluster 3.9 cli - unofficial ami(s) and ec2-imagebuilder

zeekus opened this issue · comments

Issue Description

When using AWS ParallelCluster to build custom images with a custom parent AMI, the image build process fails with an error related to the installation of the Lustre client modules. Specifically, the error message states that no candidate version is available for the lustre-client-modules-6.5.0-1017-aws package.

This issue does not occur when using the official supported AMIs provided by AWS ParallelCluster.

Reproduction Steps

  1. Use the pcluster build-image command to create a custom image with a custom parent AMI (e.g., Ubuntu 22.04).
  2. During the image build process, the installation of the Lustre client modules fails with the following error:
Stdout: [2024-06-08T08:17:48+00:00] FATAL: Chef::Exceptions::Package: lustre[Install FSx options] (aws-parallelcluster-environment::install line 22) had an error: Chef::Exceptions::Package: apt_package[lustre-client-modules-6.5.0-1017-aws, lustre-client-modules-aws, initramfs-tools] (aws-parallelcluster-environment::install line 27) had an error: Chef::Exceptions::Package: No candidate version available for lustre-client-modules-6.5.0-1017-aws
  1. The image build status is marked as BUILD_FAILED.

Affected Versions

  • AWS ParallelCluster version 3.9.2 (and potentially other versions)

Additional Notes

  • The issue does not occur when using the official supported AMIs provided by AWS ParallelCluster.
  • The AWS support team has been notified and is investigating the root cause of this behavior.
  • Further updates and potential workarounds or resolutions will be provided by the AWS support team.

I think I know what is going on here. It appears that 'pcluster build-image' is only working when you start with an 'official ami'.

ref: pcluster list-official-images | less

I built two successfully from the official ami(s).

image_builder_from_official_images

Custom images seem to have an error similar to this. Maybe this has something to do with the lustre change or late.

image

I opened a support ticket with AWS EC2-imagebuilder. There appears to be a change that was introduced to image builder that may be breaking the ability of pcluster users from using pclsuter build-image using a custom AMI . Here are the tech notes from my ticket:

source: AWS support.

From the case details, I understand that you’re preparing images using AWS ParallelCluster (which makes use of Image Builder at the backend) for building HPC clusters, however you are encountering Chef errors that were not observed previously. Therefore you wish to know if there has been some changes made to chef code which are causing such errors. Please feel free to correct me if there is any gap in my understanding of the issue.

In order to verify the issue from my end, I replicated the scenario in my test environment by creating 2 ParallelCluster images, one with the official supported image and other with custom image.

Since, you observed the issue to be occurring for all the linux distros, I replicated with using Ubuntu 22.04 OS AMI as it is one of the AMI that already comes pre installed with SSM agent.

I ran “pcluster build-image” command with the build configuration similar as yours for the AMI “ami-039bb043f0419a703” which is the official Ubuntu 22.04 image for ParallelCluster in us-east-1 region. Along with this, I ran the same command again with ParentImage as custom Ubuntu 22.04 AMI ID. In order to simplify the replication and isolate the issue I did not install any explicit package on the custom AMI.

After both the execution/builds were completed, I could observe the same results as you. The build with official Ubuntu 22.04 was successfully completed, however the build with custom Ubuntu 22.04 as the ParentImage got failed with below error which is the same as you observed :
~~~~~

Stdout: [2024-06-08T08:17:48+00:00] FATAL: Chef::Exceptions::Package: lustre[Install FSx options] (aws-parallelcluster-environment::install line 22) had an error: Chef::Exceptions::Package: apt_package[lustre-client-modules-6.5.0-1017-aws, lustre-client-modules-aws, initramfs-tools] (aws-parallelcluster-environment::install line 27) had an error: Chef::Exceptions::Package: No candidate version available for lustre-client-modules-6.5.0-1017-aws
~~~~~ 

Also verified the build status as below :
————————
>> pcluster list-images --r us-east-1 --image-status AVAILABLE

{
  "images": [
    {
      "imageId": "ubuntu_22_official",
      "imageBuildStatus": "BUILD_COMPLETE",
      "ec2AmiInfo": {
        "amiId": "ami-04785fxxxxxxxxx”
      },
      "region": "us-east-1",
      "version": "3.9.2"
    }
  ]
}


>> pcluster list-images --r us-east-1 --image-status FAILED
{
  "images": [
    {
      "imageId": "ubuntu22custom",
      "imageBuildStatus": "BUILD_FAILED",
      "cloudformationStackStatus": "CREATE_FAILED",
      "cloudformationStackArn": "arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/ubuntu22custom/4a6065c0-2569-11ef-8010-123a7abb44f1",
      "region": "us-east-1",
      "version": "3.9.2"
    }
  ]
}
————————


Keeping the above observations in mind, I have reached out to our internal team to gather more information on the root cause of this behaviour. Please note that it can take some time before we have the first response from the team due to which it would be difficult to provide an ETA for the same. However, rest assured that I will be doing my best to make sure this gets the attention it needs.

Hi @zeekus, thanks for your interest in ParallelCluster and for reporting this issue.

The build with vanilla AMI fails because it runs on kernel 6.5.0, which is not yet supported by the latest FSx Lustre client.

The build with the official ParallelCluster AMI 3.9.2 succeeds because it runs on kernel 6.2.0, which is supported.

If you want to build a custom AMI you need to use a ParentImage having kernel 6.2.0.

Thanks for the update. This seems like a 'regression issue'. I updated the potential bug title to regession. That seems to fit.