aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataRepositoryAssociations reports sucess but files don't show up

ses-dan opened this issue · comments

If you have an active AWS support contract, please open a case with AWS Premium Support team using the below documentation to report the issue:
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html

Before submitting a new issue, please search through open GitHub Issues and check out the troubleshooting documentation.

Please make sure to add the following data in order to facilitate the root cause detection.

Required Info:

Bug description and how to reproduce:
We are building a cluster through the config and use the DataRepositoryAssociations to set up links to s3 buckets. The creation seem to report no errors, but we have issues with files not showing up for one of the DRAs.

Currently we have 2 DRAs in the config. The first specified executes the Data repository task specified properly and the files are available when logging in. The second (CFD-setup) does report success for the Data repository task but the folder is empty on the instance. When creating a new Data repository task it executes successfully and updates the folder content - thus it seems like the issue is not with the configuration of the DRA but some process not fully completed during setup.

Please be sure to attach the following logs:

Additional context:
Any other context about the problem. E.g.:

  • CLI logs: ~/.parallelcluster/pcluster-cli.log
    pcluster-cli.log

  • Custom bootstrap scripts, if any

  • Screenshots, if useful.
    image

After a bit more testing it seems like the post-install scripts that tries to access the linked S3 files are able to execute before the sync is done. When not doing roll-back the script fails on OnNodeConfigured, but when logging in to the node after some time the folders are synced the scripts finds the resources and can be run with success passed the failed step. Is this a bug or is it a user error not doing proper checks and wait?