aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pcluster create fails with tag policy

afluegel9 opened this issue · comments

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.7.2

  • Full cluster configuration without any credentials or personal data.
    pcluster-conf.txt

  • Cluster name: innolab-dev

  • Output of pcluster describe-cluster command. Cluster has not been created.

  • This is the output of the command pcluster create-cluster:
    pcluster-create-bla.txt

  • [Optional] Arn of the cluster CloudFormation main stack: has not been created

Bug description and how to reproduce:
tag policies are in effect. As far as we could determine, the problem happens when the volume should be created for the head node. parallelcluster first attempts to create the volume without tags and later adds them. This cannot work with tag policies in effect (tag policies are a recommended best practice). The volume creation fails due to missing enforced tags, though they are specified in the Tags section of the cluster configuration file.

If you are reporting issues about cluster creation failure or node failure:
If the cluster fails creation, please re-execute create-cluster action using --rollback-on-failure false option.
does not change anything.

No further logs are created. Failure is too early in the process

  • Addional info: WIth the tag enforcement disabled for volumes, the cluster is created as expected.

Hello,
Thank you for reaching out!
The logs attached indicates the cluster creation fails due to configuration validation error.

You are not authorized to perform this operation. User: arn:aws:sts::1234567890:assumed-role/AWSReservedSSO_mysso/me@example.org is not authorized to perform: ec2:RunInstances on resource: arn:aws:ec2:eu-central-1:263595276031:volume/* with an explicit deny in a service control policy. Encoded authorization failure message...

It is a validation with a dryrun of the run instance failure that seems related to the service control policy.

In ParallelCluster, I believe the volumes are created with the tags attached. It is specified in the launch template as tag specifications.

Could you try with suppress the validators to see if that work by running the create cluster command with:

--suppress-validators type:ComputeResourceLaunchTemplateValidator type:HeadNodeLaunchTemplateValidator

Thanks

Thank you very much for the hint.
Indeed with such suppressing some validations the cluster can be created.

Note: You wrote about a dryrun failure. The dryrun does not look that dry: We have seen a volume being really created without the specified tags. So i wonder if the validation suppression isn't actually a workaround.

Hi @afluegel9

ComputeResourceLaunchTemplateValidator and HeadNodeLaunchTemplateValidator behind the scenes are calling AWS run-instances API with DryRun=true, through Boto3 library.

In your case the validation was failing because the run-instances call is just a way to simulate a launch of the instance and validate the parameters are correct (subnets, instance types, etc) and probably this "simulation" was missing some parameters required by your policies.

When you suppress these validators, the validators are not called at all. The run-instances is not performed and this is the reason why you cluster creation is working.

The fact you saw the volume being really created sounds strange to me, I'd expect that this corresponds to a create-volume with --dryrun that as stated in the documentation

Checks whether you have the required permissions for the action, without actually making the request, and provides an error response. If you have the required permissions, the error response is DryRunOperation . Otherwise, it is UnauthorizedOperation .

Anyway great to know that you've been unblocked.

Thanks for sharing

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.