aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't build rocky9 AMIs due to hardcoded rocky8 URL/9.4 lustre issues

gbts opened this issue · comments

commented

This line in the rocky cookbook here: https://github.com/aws/aws-parallelcluster-cookbook/blob/16a081940c41cb3bb42cb2c1ddf8de3c059af788/cookbooks/aws-parallelcluster-platform/resources/install_packages/install_packages_rocky8.rb#L36 looks like a fallback for older kernel versions where the kernel headers are not available in the main repo. It instead tries to fetch kernel-devel directly from the rocky vault for the BaseOS repo. But since kernel-devel was moved to the AppStream repo in RHEL/Rocky 9, that URL will just hit a 404.

Enabling UpdateOsPackages in the ImageBuilder template fixes it but now it will upgrade the instance to 9.4, which fails further down when trying to build the kernel module for lustre (presumably because there's no support for the 9.4 kernel yet). So currently all rocky9 builds seem to be failing.

Hi @gbts thanks for sharing this finding.

We're going to fix the code.
In the meantime Lustre drivers for RHEL/Rocky 9.4 has been released so you should be able to build the custom Rocky 9 AMI by setting:

Build:
  UpdateOsPackages:
    Enabled: true

in the Image config.

commented

Thank you for taking a look at this @enrico-usai - for the record, before the 9.4 drivers were released I tried running with a patched cookbook that had the correct 9.3 URL, but that didn't work very well. Setting UpdateOsPackages=false was not enough to keep rocky from updating to 9.4, presumably because the 9.3 repos were empty by that time so attempting to install any package upgraded the whole system. I had some luck version-locking the kernel packages and pointing yum to the 9.3 vault repos, but in the meantime the 9.4 drivers were out so it didn't seem worth the effort anymore.

Anyway, the 9.4 builds work just fine now, so from my point of view this is resolved now

Thanks @gbts for sharing your approach.
We tracked internally the need to improve the cookbook code to be more robust and avoid this kind of issues in future.