aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EFS not mounting on computing nodes

maestro7879 opened this issue · comments

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.8.0
  • Full cluster configuration without any credentials or personal data.
  • Cluster name: prod-cluster
  • Output of pcluster describe-cluster command.
  • [Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce:
I'm receiving the below error when compute nodes launch. The headnode is fine and mount EFS.
This is a custom AMI with AWS PC installed on top.

I'm able to mount EFS on the compute nodes so I have ruled out connectivity.
Assuming this is what is running. This works manually, still fails on the compute node build though.
sudo mount -t efs -o tls fs-03ef4f0037ca3b4da:/ /opt/parallelcluster/shared
sudo mount -t efs -o tls fs-08ecab8914d696b6f:/ /opt/parallelcluster/home

If there is somewhere I can look at what the below is attempting that would be helpful.
Recipe: aws-parallelcluster-environment::mount_internal_use_ebs

  • volume[mount /opt/parallelcluster/shared] action mount[2024-02-10T01:11:03+00:00] INFO: Processing volume[mount /opt/parallelcluster/shared] action mount (aws-parallelcluster-environment::mount_internal_use_ebs line 22)

    • directory[/opt/parallelcluster/shared] action create[2024-02-10T01:11:03+00:00] INFO: Processing directory[/opt/parallelcluster/shared] action create (aws-parallelcluster-environment::mount_internal_use_ebs line 42)
      [2024-02-10T01:11:03+00:00] INFO: directory[/opt/parallelcluster/shared] mode changed to 1777

      • change mode from '0755' to '01777'
    • mount[mount /opt/parallelcluster/shared] action mount[2024-02-10T01:11:03+00:00] INFO: Processing mount[mount /opt/parallelcluster/shared] action mount (aws-parallelcluster-environment::mount_internal_use_ebs line 51)
      [2024-02-10T01:14:06+00:00] INFO: Retrying execution of mount[mount /opt/parallelcluster/shared], 9 attempts left

The issue was with iptables on the headnode. 2049 wasn't open for some reason.