microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use blobfuse2 for streamable TesInputs

MattMcL4475 opened this issue · comments

Problem:
Customers need the ability to perform random reads using a file system for large genomics reference files without downloading the entire file, which costs more and puts pressure on the storage account.

Solution:

  • If any TesInput.Streamable is set to true, the TES runner should download and install blobfuse2
  • It should aggregate all of the container mounts and only mount the minimum required mounts with blobfuse2 mount
  • It should ensure the path specified for the TesInput.path works

I confirmed that random reads in blobfuse2 work as expected:
blobfuse2 mount /ref --config-file=./b2.yaml
dd if=stLFR.split_read.1.fq.gz skip=50000000000 bs=1 count=128 iflag=skip_bytes 2>/dev/null | xxd

image
image

#!/bin/bash

# Azure Blob URL - NOTE SAS has been removed
blob_url="https://mattmcl.blob.core.windows.net/inputs/stLFR.split_read.1.fq.gz" 

# Byte range to download: Example uses the range from 50000000000 to 50000000127
range_start=50000000000
range_end=50000000127

# Using curl to download the specified byte range
curl -s -o downloaded_bytes.bin -H "Range: bytes=$range_start-$range_end" "$blob_url"
echo "From REST:"
# Display downloaded bytes in hex format for comparison
xxd downloaded_bytes.bin
echo "From blobfuse:"
# Optional: Compare with bytes extracted from the local file using dd
dd if=/ref/stLFR.split_read.1.fq.gz skip=$range_start bs=1 count=$((range_end - range_start + 1)) iflag=skip_bytes,count_bytes 2>/dev/null | xxd