Large Bam files fail to copy into the output directory
simalicrum opened this issue · comments
Describe the bug
Running the workflow here on a >100GB dataset: https://mondrian-scwgs.github.io/mondrian/#/
Workflow steps appear to complete successfully until the end of the last workflow task where a 77GB output Bam should be copied into the output directory to complete the run. All other output files are copied into the output directory.
The successfully completed analysis output Bam is successfully built but never copied from the /cromwell-executions/ container.
Trigger file is moved into the 'failed' directory with the error "CromwellFailed" in the json output.
Steps to Reproduce
Running the workflow here on a large dataset: https://mondrian-scwgs.github.io/mondrian/#/ with a large output file in the /cromwell-executions/ workflow directory.
Expected behavior
Large output files should be copied into specified output location.
Deployment details: (any information you can provide would be helpful):
Cromwell on Azure 4.5 deployment with no changes to configuration.
Screenshots
Drilled down into AKS workload container 'cromwell' and found the following Exceptions:
Additional context
Workflow runs with smaller output files using the identical workflow files complete successfully.
This is issue is mitigated by ensuring the Temp disks on the AKS nodes are big enough to accommodate the size of the output file during the final copy into the output container from the cromwell-executions container. I also changed the size of the various PVCs on the cluster in case that had some kind of impact.
I'm 99% sure this is caused by the CSI driver for AKS. There is some kind of staging that happens in the node Temp disk during the copy. It looks like to me that the entire file is copied into the Temp disk before writing to the output directory.