aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FSx creation failure due to FSx Security Group creation race condition?

stefan-maxar opened this issue · comments

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.8.0

Bug description and how to reproduce:
During cluster creation, we are noticing an occasional issue where the FSx Lustre filesystem fails to create with the following error:
The file system cannot be created because the default security group in the subnet provided or the provided security groups do not permit Lustre LNET network traffic on port 988 (Service: AmazonFSx; Status Code: 400; Error Code: InvalidNetworkSettings; Request ID: 0949dbfc-5e0a-4f03-bf89-d7e74540f8ec; Proxy: null)

Upon examination of the CloudFormation stack and events (see the attached image), it looks like the FSx ingress/egress security group rules that are created and configured might have taken ~1 second too long to create/configure, resulting in the failure to create the FSx filesystem and the above error. Is there anyway to mitigate this potential race condition? We noticed this PR that might help, but not sure of its progress: #5851

Thanks for any insight you can provide! CC: @ccassidy-maxar

cloudformation_snapshot

Hi Stefan,

Thank you for letting us know the problem. I've created a bug fix task in our backlog.
Are you currently blocked by this? If yes, we can try to find a workaround for you.

Thank you!
Hanwen

Hey Hanwen,

Not a blocker given the occurrence rate is quite low. But some sort of work around/mitigation tactic would be good to know just in case it does become more prevalent.