Optimizing File Localization to Avoid Excess Downloads

Question

Optimizing File Localization to Avoid Excess Downloads

superbsky opened this issue a year ago · comments

Problem:
I am exploring options to use a local file path on the storage account for task execution without the need to localize the input files. I attempted to place the input files into the /cromwell-executions path, which is mounted to the task VM. During execution, I noticed that the task uses a path within /cromwell-executions, but the download script still downloads all my input files for the task.

Solution:
Upon checking BatchScheduler.cs, it appears that it collects all input files, including additionalInputFiles, for downloading, even when the local path is available.

Describe alternatives you've considered
Please advise if it is possible to use the "streamable" or "localization_optional" flags for the input files to avoid excessive file downloading. I have seen discussions in the TES repository but I'm unsure if CoA currently supports these flags.

Additional context
In general, the goal is to utilize an Azure Storage account as a mount for the input files and exclude unnecessary file localization. I noticed that Cromwell recently added support for the Blobs filesystem, but I am uncertain if it would help resolve this issue.

Blair L Murri · Answer 1 · Tue Jun 06 2023 02:14:45 GMT+0800 (China Standard Time)

Problem/Solution
Our TES does not run any tasks on the same machine as Cromwell is running, so the mounts available to Cromwell are not available to the tasks.

Our TES implementation does not mount entire storage containers (considered a security risk when TES is used for shared groups, one of our primary use cases), nor subpaths of containers (currently would require installing drivers on every compute node), on the compute nodes running the tasks. Further, the TES spec doesn't seem to have the concept of an execution directory existing beyond task completion (for CoA, that concept comes from Cromwell), so supporting mounted storage has to be a configurable opt-in (probably in the deployment configuration).

Alternatives
The downloads collected by BatchScheduler.cs uses the path inside of the executor docker container as its definition of "local path", so the download would be required regardless (without implementing mounting). Cromwell currently doesn't currently provide localization_optional to any backend other than GCE. If Cromwell were to support localization_optional on TES, it would do so by setting streamable, and we would have to implement support for that by skipping the download (indicating that the task knows how to and will access the content from the URL itself, which it would need to know, as described here).

superbsky · Answer 2 · Tue Jun 06 2023 02:37:21 GMT+0800 (China Standard Time)

Thank you for the clarification!

I noticed that the dockerRoot value is set in the cromwell-application.conf and assumed that it is actually mounted to the task VM.

backend {
default = "TES"
providers {
TES {
actor-factory = "cromwell.backend.impl.tes.TesBackendLifecycleActorFactory"
config {
filesystems {
http { }
}
root = "/cromwell-executions"
dockerRoot = "/cromwell-executions"
...

Also, I can see that the logs in my task during execution are using the /cromwell-executions directory where I placed my input files.

2023-06-04 18:25:24,576 INFO - TesAsyncBackendJobExecutionActor [UUID(29b94176)ExomeGermlineSingleSample.PairedFastQsToUnmappedBAM:NA:1]: `/gatk/gatk --java-options "-Xmx19g"
FastqToSam
--FASTQ /cromwell-executions/fastq/R1_001.fastq.gz
--FASTQ2 /cromwell-executions/fastq/R2_001.fastq.gz
...

These files were copied using the URL to the input directory, even though I assumed it was already mounted there.

total_bytes=0 && echo $(date +%T) && path='/cromwell-executions/fastq/R2_001.fastq.gz' && url='https://coa.blob.core.windows.net/cromwell-executions/fastq/R2_001.fastq.gz?sv=SAS' && blobxfer download --storage-url "$url" --local-path "$path" --chunk-size-bytes 104857600 --rename --include 'fastq/R2_001.fastq.gz'
...

So, what would be the best approach for me to reduce the number of files being copied from the Storage Account to the VM executing tasks? The only solution I can think of is to combine the execution of tasks that use the same or similar inputs/outputs. However, this approach is labor-intensive and prone to errors.

Blair L Murri · Answer 3 · Tue Jun 06 2023 02:54:04 GMT+0800 (China Standard Time)

That container is currently mounted to the VM Cromwell is running in, so it sees that container as a local file system, which means that everything Cromwell accesses directly will not require uploading/downloading until the entire workflow is complete and you are collecting your results if you pre-stage your inputs as you describe. However, tasks that run through the backend (instead of inside of Cromwell itself) will still involve downloading/uploading (which is the reason that intermediate files created during the tasks aren't found in the /cromwell-executions container) today.

We can certainly look into mounting (something I've personally wanted to test to see how it would affect overall costs in terms of both time and spend) but I'm not certain when we could start on it.

superbsky · Answer 4 · Tue Jun 06 2023 02:58:55 GMT+0800 (China Standard Time)

Thanks again. I'm looking forward to hearing more about this because the downloading/uploading process is taking almost the same amount of time as the actual computation itself.

Blair L Murri · Answer 5 · Tue Jun 06 2023 03:16:34 GMT+0800 (China Standard Time)

I agree that combining tasks would not be the best idea. The VMs we use for compute node do appear to have blobfuse installed, but the tasks run in a container that is not provided access to the any of the fuse drivers, so at this point your stuck with the file transfers.

You might try selecting vm_sizes that have faster networking for your tasks.

superbsky · Answer 6 · Fri Jun 16 2023 23:50:01 GMT+0800 (China Standard Time)

Is it possible to specify the mount configuration for the pools in the config file, specifically in src/deploy-cromwell-on-azure/scripts/env-04-settings.txt? This would help save a lot of trouble. Additionally, we can add this mount to the Docker job image, further streamlining the process or just mount it to the $AZ_BATCH_TASK_WORKING_DIR/wd

Blair L Murri · Answer 7 · Sat Jun 17 2023 02:09:59 GMT+0800 (China Standard Time)

Right now there's no provision for the compute nodes to mount anything, unless that is done inside of the tasks themselves. I'm looking at what a solution might look like, but yes, some combination of configuration and/or task backend parameters will be involved.

We recently added TES 1.1 "streamable" flag (which will implement the Cromwell "localization_optional" flag) but Cromwell hasn't implemented support for that in TES, so it still won't prevent the downloads of your inputs. Ultimately, you will be waiting for them to add support for it when using its TES backend implementation. All we can do here is facilitate mounting your specified container.

Blair L Murri · Answer 8 · Fri Mar 01 2024 05:15:19 GMT+0800 (China Standard Time)

Note that Azure (who supplies the blob fuse filesystem driver that Cromwell currently uses) recommends NOT sharing that blob container (not the file system per-se, but the entire blob container) with any other agent (such as the tasks that run through CoA/TES) making any changes to those blobs. I would recommend waiting for #694 (or something similar) to be implemented before moving forward with any effort to localize file systems on the compute nodes.