cluster create can't access script on S3
blakemertz opened this issue · comments
Went to generate a new cluster v3.8.0 and the setup script that I used to generate a cluster v3.5 is no longer working:
WaitCondition received failed message: 'fetch_and_run - Failed to download OnNodeConfigured script 1 using aws s3. Please check /var/log/cfn-init.log in the head node, or check the cfn-init.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs for more details on ParallelCluster logs.' for uniqueId: i-0c83383427d71dbf9
I have checked the file in S3, and nothing has changed from when I used it successfully to create my cluster in 3.5. I came across a couple of issue reports with respect to scripts that suggested removing additional comments (#) from the script and converting from windows to linux formatting (ran dos2unix on it to make sure) but no joy. Attached is my cfn-init.log file as well as my redacted yaml below:
Imds:
ImdsSupport: v2.0
HeadNode:
InstanceType: c5.2xlarge
Imds:
Secured: true
Ssh:
KeyName: mertz_key
LocalStorage:
RootVolume:
VolumeType: gp3
Networking:
SubnetId: subnet-659e6f4a
ElasticIp: true
AdditionalSecurityGroups:
- sg-e2f0df90
Iam:
AdditionalIamPolicies:
- Policy: >-
arn:aws:iam::2710136xxxxx:policy/DomainCertificateSecretReadPolicy-modulus-AD
CustomActions:
OnNodeConfigured:
Script: s3://parallelcluster-10552b48cfa2e9aa-v1-do-not-delete/active-directory.head.post.sh
Args:
- arn:aws:secretsmanager:us-east-1:2710136xxxxx:secret:DomainCertificateSecret-modulus-AD-bzzAOd
- /opt/parallelcluster/shared/directory_service/domain-certificate.crt
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: modbind
AllocationStrategy: lowest-price
ComputeResources:
- Name: modbind-cr-0
Instances:
- InstanceType: g4dn.xlarge
MinCount: 0
MaxCount: 100
DisableSimultaneousMultithreading: true
ComputeSettings:
LocalStorage:
RootVolume:
VolumeType: gp3
Networking:
SubnetIds:
- subnet-659xxxxx
AssignPublicIp: true
PlacementGroup: {}
AdditionalSecurityGroups:
- sg-e2fxxxxx
CustomActions: {}
- Name: parameters
AllocationStrategy: lowest-price
ComputeResources:
- Name: parameters-cr-0
Instances:
- InstanceType: c6a.24xlarge
MinCount: 0
MaxCount: 5
DisableSimultaneousMultithreading: true
ComputeSettings:
LocalStorage:
RootVolume:
VolumeType: gp3
Networking:
SubnetIds:
- subnet-659xxxxx
AssignPublicIp: true
PlacementGroup: {}
AdditionalSecurityGroups:
- sg-e2fxxxxx
- Name: modbind-spot
AllocationStrategy: lowest-price
ComputeResources:
- Name: modbind-spot-cr-0
Instances:
- InstanceType: g4dn.xlarge
MinCount: 0
MaxCount: 100
DisableSimultaneousMultithreading: true
ComputeSettings:
LocalStorage:
RootVolume:
VolumeType: gp3
Networking:
SubnetIds:
- subnet-659xxxxx
AssignPublicIp: true
PlacementGroup: {}
AdditionalSecurityGroups:
- sg-e2fxxxxx
CapacityType: SPOT
SlurmSettings: {}
Region: us-east-1
Image:
Os: centos7
DirectoryService:
GenerateSshKeysForUsers: true
DomainName: modulus.ad.com
DomainAddr: ldaps://modulus.ad.com
PasswordSecretArn: >-
arn:aws:secretsmanager:us-east-1:2710136xxxxx:secret:PasswordSecret-modulus-AD-MM2LIh
DomainReadOnlyUser: cn=ReadOnlyUser,ou=Users,ou=MODULUS,dc=modulus,dc=ad,dc=com
LdapTlsCaCert: /opt/parallelcluster/shared/directory_service/domain-certificate.crt
LdapTlsReqCert: hard
SharedStorage:
- Name: Efs0
StorageType: Efs
MountDir: /shared
EfsSettings:
FileSystemId: fs-c3exxxxx
and here is the script that I am failing to download/run:
#!/bin/bash
set -e
CERTIFICATE_SECRET_ARN="$1"
CERTIFICATE_PATH="$2"
[[ -z $CERTIFICATE_SECRET_ARN ]] && echo "[ERROR] Missing CERTIFICATE_SECRET_ARN" && exit 1
[[ -z $CERTIFICATE_PATH ]] && echo "[ERROR] Missing CERTIFICATE_PATH" && exit 1
source /etc/parallelcluster/cfnconfig
REGION="${cfn_region:?}"
mkdir -p $(dirname $CERTIFICATE_PATH)
aws secretsmanager get-secret-value --region $REGION --secret-id $CERTIFICATE_SECRET_ARN --query SecretString --output text > $CERTIFICATE_PATH
Error message from the cfn-init.log:
fetch_and_run - Failed to download OnNodeConfigured script 1 s3://parallelcluster-10552b48cfa2e9aa-v1-do-not-delete/active-directory.head.post.sh using aws s3, cause: An error occurred (403) when calling the HeadObject operation: Forbidden. Please check /var/log/cfn-init.log in the head node, or check the cfn-init.log in CloudWatch logs. Please refer to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-get-logs for more details on ParallelCluster logs.
The headnode doesn't have permission to access the S3 bucket, could you try with adding S3Access to the IAM section of the headnode to see if that work?
Iam:
AdditionalIamPolicies:
- Policy: >-
arn:aws:iam::2710136xxxxx:policy/DomainCertificateSecretReadPolicy-modulus-AD
S3Access:
- BucketName:your_bucket_name
Thank you!
@chenwany thanks so much for catching that missing parameter -- you saved my bacon. Once I put in the call to the S3Access the cluster compiled for me.