aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FSX Timeout when changing permission after mounting

samagids opened this issue · comments

So during build, the cluster headnodes mounts the /scratch FSX volume but hangs while changing permission from the origin to 0777 and ownership to root. It timeout aften 600 seconds causing a cluster create failure. Increases the HeadNodeBootstrapTimeout but it still did not help. Looks like there is a hard coded FSX mount timeout of 600 seconds. Our volume is 6.8T 2.9T 3.9T 44% /scratch

Hi @samagids, sorry for the delay. I hope you were able to fix the issue in the meantime.

I tried to look at the attached logs but from the logs I cannot see any error related to mount of FSx volume.
The error in the log-events-viewer-results is about a bootstrap issue of the head node:

1714755511740,"2024-05-03 16:58:31,740 [ERROR] Command chef (cinc-client --local-mode --config /etc/chef/client.rb --log_level info --logfile /var/log/chef-client.log --force-formatter --no-color --chef-zero-port 8889 --json-attributes /etc/chef/dna.json --override-runlist aws-parallelcluster-entrypoints::config) failed"
1714755511740,"2024-05-03 16:58:31,740 [DEBUG] Command chef output: "
1714755511740,"2024-05-03 16:58:31,740 [ERROR] Error encountered during build of chefConfig: Command chef failed"
1714755511740,"Traceback (most recent call last):
  File ""/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py"", line 579, in run_config
    CloudFormationCarpenter(config, self._auth_config, self.strict_mode).build(worklog)
  File ""/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py"", line 278, in build
    self._config.commands)
  File ""/usr/lib/python3.7/site-packages/cfnbootstrap/command_tool.py"", line 127, in apply
    raise ToolError(u""Command %s failed"" % name)"
1714755511740,cfnbootstrap.construction_errors.ToolError: Command chef failed
1714755511866,"2024-05-03 16:58:31,866 [ERROR] -----------------------BUILD FAILED!------------------------"

In this step the cinc-client is executing some recipes to configure the head node and it's failing. It may be because of the mentioned FSx mount issue, but to confirm this we need to look at the /var/log/chef-client.log of the HeadNode.
See troubleshooting guide.

Then I see you have an OnNodeStart and OnNodeConfigured scripts, and you have Active Directory integration too.
Can you try the creation without the custom bootstrap scripts or without the AD integration? It would be nice to verify if without them the mount timeout is still there.


I removed the attached files from the GitHub issue because in the logs there were some details about your subnets, vpcs, policies, AD settings, etc.

If you have an active AWS support contract, please open a case with AWS Premium Support team using the below documentation to report the issue:
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html and attach the log files there.

Enrico

I saw you're passing an existing FileSystemId for FSx.

An important thing to check are the security groups to ensure the nodes are able to mount the File system. As stated in the documentation:

Make sure that traffic is allowed between the cluster and file system by doing one of the following:

Configure the security groups of the file system to allow the traffic to and from the CIDR or prefix list of cluster subnets.

Note
AWS ParallelCluster validates that ports are open and that the CIDR or prefix list is configured. AWS ParallelCluster doesn't validate the content of CIDR block or prefix list.

Set custom security groups for cluster nodes by using SlurmQueues / Networking / SecurityGroups and HeadNode / Networking / SecurityGroups. The custom security groups must be configured to allow traffic between the cluster and the file system.

Note
If all cluster nodes use custom security groups, AWS ParallelCluster only validates that the ports are open. AWS ParallelCluster doesn't validate that the source and destination are properly configured.

BTW I'd suggest to use AdditionalSecurityGroups rather than SecurityGroups. The former will add your security groups to the SGs created by Pcluster, the latter will replace all Pcluster's security groups so you should use them carefully, being sure to have all the communications enabled.


Anyway from the chef-init logs we should be able to identify where the bootstrap is blocked.