Feature Request: Dataset download all datasets within specified path

Question

Feature Request: Dataset download all datasets within specified path

ammaraziz opened this issue 5 months ago · comments

Hi Nextclade Folks,

I'd like to run nextclade on all influenza genes for many samples. The first step is to nextclade dataset get the required datasets:

nextclade dataset get --name "nextstrain/flu/h3n2/pa" --output-dir outputdir/pa
nextclade dataset get --name "nextstrain/flu/h3n2/mp" --output-dir outputdir/mp
...

I would need to enter/code 8x nextclade dataset get commands.

It would be amazing if I could just simply do:

nextclade dataset get --name "nextstrain/flu/h3n2/*" --output-dir outputdir/

Note the * to stop accidentally downloading all the datasets.

There are multiple ha datasets which are downloaded - this is okay in my opinion but not ideal. In such cases maybe download the flu_h3n2_ha_broad?

An alternative solution would be to create a new shortcut that downloads a set of predefined datasets.

Feel free to close this issue if you think it's frivolous.

Ivan Aksamentov · Answer 1 · Mon Feb 12 2024 08:00:41 GMT+0800 (China Standard Time)

We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments. And perhaps even to combine results for different segments and process them further. But haven't figured anything quite yet. Is this your use-case?

In your proposal, how would you access the downloaded datasets afterwards? There is no way of knowing what's being downloaded and to where and how many. And you'd still need to copy-paste nextclade run 8 times. Also, datasets come and go, today it might be 8, tomorrow there's 32.

Note the * to stop accidentally downloading all the datasets.

I did not understand how * should stop downloading all datasets. Usually in paths * denotes so-called wildcard, that is any path under that particular path. Does nextclade currently downloads all datasets if you omit the *? (if so, it's a bug!) Can you clarify?

There are multiple ha datasets which are downloaded - this is okay in my opinion but not ideal. In such cases maybe download the flu_h3n2_ha_broad?

How do we pick flu_h3n2_ha_broad among others? And what if it's not a flu/*, but sc2/*? Nextclade features should make sense for all viruses (even the ones that haven't been added yet).

While wildcard downloads are not supported yet, in the meantime, there is a couple of other approaches you might consider:

If you don't use the dataset files outside of nextclade run, you can avoid separate dataset download entirely. The nextclade run command accepts --dataset-name argument, which makes it to download the dataset in-memory (without writing to disk) and run with it immediately. This way you don't need dataset get calls at all:
```
nextclade run --dataset-name="nextstrain/flu/h3n2/pa" --output-dir="results/" my.fasta.gz
```
If you insist on having dataset files on disk (perhaps you use it in your processing after nextclade), you can use a loop to avoid repetition. Here is a few examples in bash (but you can also setup Snakemake or other workflow frameworks to run multiple things in some sort of a loop, according to a set of parameters):
```
$ for v in pa mp; do nextclade dataset get --name="nextstrain/flu/h3n2/$v" --output-dir="outputdir/$v"; done
```

Instead of hardcoding dataset names, you can use dataset list with --search argument to find datasets using sub-string match. This is probably the closest thing to the wildcard * syntax you've requested:

$ nextclade dataset list --only-names --search=flu/h3
nextstrain/flu/h3n2/ha/CY163680
nextstrain/flu/h3n2/ha/EPI1857216
nextstrain/flu/h3n2/na/EPI1857215
nextstrain/flu/h3n2/pb1
nextstrain/flu/h3n2/np
nextstrain/flu/h3n2/ns
nextstrain/flu/h3n2/mp
nextstrain/flu/h3n2/pa
nextstrain/flu/h3n2/pb2

Then you can feed this list into a loop instead of a hardcoded list:

for v in $( nextclade dataset list --only-names --search=flu/h3 ); do
  nextclade dataset get --name="$v" --output-dir="outputdir/$v";
done

Nothing stops you from plugging your entire processing into this loop - this way you always know what is being downloaded and where:

for v in $( nextclade dataset list --only-names --search=flu/h3 ); do
  nextclade dataset get --name="$v" --output-dir="outputdir/$v";
  nextclade run --input-dataset="outputdir/$v" --output=dir="results/" "my_$v.fasta.gz";
  my_script.py --virus="$v" --nextclade-tsv="results/$v/nextclade.tsv";
done

You can use GNU Parallel to run different datasets concurrently (workflows also often have a way to do this automatically):

function run_one() {
  v=$1
  nextclade dataset get --name="$v" --output-dir="outputdir/$v";
  nextclade run --input-dataset="outputdir/$v" --output=dir="results/" "my_$v.fasta.gz";
  my_script.py --virus="$v" --nextclade-tsv="results/$v/nextclade.tsv";
}
export -f run_one

parallel --jobs=4 run_one :::$( nextclade dataset list --only-names --search=flu/h3 )

These approaches allow you to avoid code duplication, and you always know what's being downloaded and where - exact names and paths. Loops of course complicate things quite a bit, that's a downside. And, for sanity checks, you likely want to test that you've got all the datasets you want, to avoid omissions.

If you need more control, you can filter datasets additionally by piping the list into the grep or into a script:

$ nextclade dataset list --only-names --search=flu/ | grep -E '(mp|pa)' | sort
nextstrain/flu/h1n1pdm/mp
nextstrain/flu/h1n1pdm/pa
nextstrain/flu/h3n2/mp
nextstrain/flu/h3n2/pa

$ nextclade dataset list --only-names --search=flu/h3 | my_filter.py

and then feed the resulting list into the loop.

For even more control, you can also add --json flag to the dataset list command. This will print your search results in JSON format, which you can later feed to jq or to a script. This way you can also implement your own search/filtering - by dumping JSON of all datasets, choosing a subset, and then downloading only it. Contrived example:

$ nextclade dataset list --json --search=flu/ | jq -r '.[] | select(.attributes.segment == "pa" and .attributes["reference name"] == "A/NewYork/392/2004") | .path'
nextstrain/flu/h3n2/pa

Ammar Aziz · Answer 2 · Mon Feb 12 2024 14:04:33 GMT+0800 (China Standard Time)

We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments. And perhaps even to combine results for different segments and process them further. But haven't figured anything quite yet. Is this your use-case?

Yes it's very close to the use case!

In your proposal, how would you access the downloaded datasets afterwards? There is no way of knowing what's being downloaded and to where and how many. And you'd still need to copy-paste nextclade run 8 times. Also, datasets come and go, today it might be 8, tomorrow there's 32.

I had not considered this, this puts a big red ! on my request. But as you hinted above, the ability to make multiple runs with a single command falls within this idea.

I did not understand how * should stop downloading all datasets. Usually in paths * denotes so-called wildcard, that is any path under that particular path. Does nextclade currently downloads all datasets if you omit the *? (if so, it's a bug!) Can you clarify?

To stop accidentally entering in: nextstrain/flu/ which will download all datasets for all species of full (but as you said this doesn't generalise to other species supported by nextclade).

While wildcard downloads are not supported yet, in the meantime, there is a couple of other approaches you might consider:
....

Thank you for the code and the explanation, this does achieve my task (or what instigated this feature request). You've perfectly represented what I was trying to do in the code, that is not hard code the flu datasets.

Going back to this:

We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments.

This feature request would be part of this bigger picture of running multiple runs with a single command. Therefore, closing this issue as in hindsight it should have been a discussion.

Thanks again!

Richard Neher · Answer 3 · Mon Feb 12 2024 16:35:08 GMT+0800 (China Standard Time)

@ammaraziz, what might be useful for you is nextclade sort. We are currently refining some of the matching parameters, but what you can do for example is

nextclade sort all_my_rsv_sequences.fasta --output-dir split_by_dataset --output-results-tsv table_with_matches.tsv

this will split your input sequences into files corresponding to datasets (and their prefixes).

Ammar Aziz · Answer 4 · Sun Mar 10 2024 19:13:30 GMT+0800 (China Standard Time)

Hi Richard,

That's actually the usecase which triggered this request.

Thanks again :)