aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(3.3.0-3.9.0) Potential data loss issue when removing storage with update-cluster in AWS ParallelCluster 3.3 and above

dreambeyondorange opened this issue · comments

Bug description

Starting with ParallelCluster 3.3.0, users can add and remove shared storage from a cluster with a pcluster update-cluster operation. We identified an issue that could lead to a race condition with a potential for data loss. While unmounting a filesystem, ParallelCluster normally performs a lazy unmount operation of the filesystem and then proceeds to clean up the mount point by deleting the mountdir and all subfolders under the mountdir. This could lead to loss of data if no data retention policy is applied to the filesystem.

Affected versions (OSes, schedulers)

This issue impacts all ParallelCluster versions from 3.3.0 to 3.9.0, across all the OSes, schedulers and shared storage types.

Mitigation

You can find a detailed explanation and the mitigation of the problem here.