aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cluster create can't access script on S3

blakemertz opened this issue · comments

Went to generate a new cluster v3.8.0 and the setup script that I used to generate a cluster v3.5 is no longer working:

WaitCondition received failed message: 'fetch_and_run - Failed to download OnNodeConfigured script 1 using aws s3. Please check /var/log/cfn-init.log in the head node, or check the cfn-init.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs for more details on ParallelCluster logs.' for uniqueId: i-0c83383427d71dbf9

I have checked the file in S3, and nothing has changed from when I used it successfully to create my cluster in 3.5. I came across a couple of issue reports with respect to scripts that suggested removing additional comments (#) from the script and converting from windows to linux formatting (ran dos2unix on it to make sure) but no joy. Attached is my cfn-init.log file as well as my redacted yaml below:

Imds:
  ImdsSupport: v2.0
HeadNode:
  InstanceType: c5.2xlarge
  Imds:
    Secured: true
  Ssh:
    KeyName: mertz_key
  LocalStorage:
    RootVolume:
      VolumeType: gp3
  Networking:
    SubnetId: subnet-659e6f4a
    ElasticIp: true
    AdditionalSecurityGroups:
      - sg-e2f0df90
  Iam:
    AdditionalIamPolicies:
      - Policy: >-
          arn:aws:iam::2710136xxxxx:policy/DomainCertificateSecretReadPolicy-modulus-AD
  CustomActions:
    OnNodeConfigured:
      Script: s3://parallelcluster-10552b48cfa2e9aa-v1-do-not-delete/active-directory.head.post.sh
      Args:
        - arn:aws:secretsmanager:us-east-1:2710136xxxxx:secret:DomainCertificateSecret-modulus-AD-bzzAOd
        - /opt/parallelcluster/shared/directory_service/domain-certificate.crt
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: modbind
      AllocationStrategy: lowest-price
      ComputeResources:
        - Name: modbind-cr-0
          Instances:
            - InstanceType: g4dn.xlarge
          MinCount: 0
          MaxCount: 100
          DisableSimultaneousMultithreading: true
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      Networking:
        SubnetIds:
          - subnet-659xxxxx
        AssignPublicIp: true
        PlacementGroup: {}
        AdditionalSecurityGroups:
          - sg-e2fxxxxx
      CustomActions: {}
    - Name: parameters
      AllocationStrategy: lowest-price
      ComputeResources:
        - Name: parameters-cr-0
          Instances:
            - InstanceType: c6a.24xlarge
          MinCount: 0
          MaxCount: 5
          DisableSimultaneousMultithreading: true
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      Networking:
        SubnetIds:
          - subnet-659xxxxx
        AssignPublicIp: true
        PlacementGroup: {}
        AdditionalSecurityGroups:
          - sg-e2fxxxxx
    - Name: modbind-spot
      AllocationStrategy: lowest-price
      ComputeResources:
        - Name: modbind-spot-cr-0
          Instances:
            - InstanceType: g4dn.xlarge
          MinCount: 0
          MaxCount: 100
          DisableSimultaneousMultithreading: true
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      Networking:
        SubnetIds:
          - subnet-659xxxxx
        AssignPublicIp: true
        PlacementGroup: {}
        AdditionalSecurityGroups:
          - sg-e2fxxxxx
      CapacityType: SPOT
  SlurmSettings: {}
Region: us-east-1
Image:
  Os: centos7
DirectoryService:
  GenerateSshKeysForUsers: true
  DomainName: modulus.ad.com
  DomainAddr: ldaps://modulus.ad.com
  PasswordSecretArn: >-
    arn:aws:secretsmanager:us-east-1:2710136xxxxx:secret:PasswordSecret-modulus-AD-MM2LIh
  DomainReadOnlyUser: cn=ReadOnlyUser,ou=Users,ou=MODULUS,dc=modulus,dc=ad,dc=com
  LdapTlsCaCert: /opt/parallelcluster/shared/directory_service/domain-certificate.crt
  LdapTlsReqCert: hard
SharedStorage:
  - Name: Efs0
    StorageType: Efs
    MountDir: /shared
    EfsSettings:
      FileSystemId: fs-c3exxxxx

and here is the script that I am failing to download/run:

#!/bin/bash
set -e

CERTIFICATE_SECRET_ARN="$1"
CERTIFICATE_PATH="$2"

[[ -z $CERTIFICATE_SECRET_ARN ]] && echo "[ERROR] Missing CERTIFICATE_SECRET_ARN" && exit 1
[[ -z $CERTIFICATE_PATH ]] && echo "[ERROR] Missing CERTIFICATE_PATH" && exit 1

source /etc/parallelcluster/cfnconfig
REGION="${cfn_region:?}"

mkdir -p $(dirname $CERTIFICATE_PATH)
aws secretsmanager get-secret-value --region $REGION --secret-id $CERTIFICATE_SECRET_ARN --query SecretString --output text > $CERTIFICATE_PATH

Error message from the cfn-init.log:

fetch_and_run - Failed to download OnNodeConfigured script 1 s3://parallelcluster-10552b48cfa2e9aa-v1-do-not-delete/active-directory.head.post.sh using aws s3, cause: An error occurred (403) when calling the HeadObject operation: Forbidden. Please check /var/log/cfn-init.log in the head node, or check the cfn-init.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs for more details on ParallelCluster logs.

The headnode doesn't have permission to access the S3 bucket, could you try with adding S3Access to the IAM section of the headnode to see if that work?

Iam:
    AdditionalIamPolicies:
      - Policy: >-
          arn:aws:iam::2710136xxxxx:policy/DomainCertificateSecretReadPolicy-modulus-AD
    S3Access:
      - BucketName:your_bucket_name

Thank you!

@chenwany thanks so much for catching that missing parameter -- you saved my bacon. Once I put in the call to the S3Access the cluster compiled for me.