Incorporation of enterovirus dataset into nextalde docker container

Question

Incorporation of enterovirus dataset into nextalde docker container

laura-bankers opened this issue a year ago · comments

Hello,

We are developing a bioinformatics workflow for EV-D68 WGS for public health surveillance to be run on Terra.bio. There appears to be an enterovirus nextclade dataset in the github repo, however, it is not available in the most recent docker container. We would love to be able to use nextclade for clade assignment. Would it be possible to get this dataset added to the container available on dockstore?

Thanks,
Laura

Ivan Aksamentov · Answer 1 · Wed Aug 16 2023 05:46:33 GMT+0800 (China Standard Time)

Hi Laura @laura-bankers,

Thanks for your interest! We are very happy that people are reaching out and asking about new pathogens.

Do you mean these files?

https://github.com/nextstrain/nextclade/tree/master/data/enterovirus/d68

Sadly, these are only a genome annotation, a reference sequence and a few example sequences, so they are not enough to run Nextclade (which also currently requires a reference tree, QC config and virus properties config). These files are historically only there to provide some examples to run Nextalign (which is like Nextclade, but only does alignment and translation).

Or maybe you've seen other files somewhere else? Could you please send me a link?

I don't exclude a possibility that there are datasets exist on the internet, created by the community and which we don't know about.

A few notes which may help you in your work with Nextclade:

Dockstore containers is not something Nextclade team is aware of. This is not an official source. Probably some community effort. Which we are happy to hear about, but don't have bandwidth to support officially.

Official docker containers (on DockerHub) or any other official means of distribution of Nextclade CLI (listed in the docs) don't contain datasets on purpose. Nextclade is pathogen-agnostic by design. It only reads an index.json file hosted elsewhere on our servers, which contains a list of known datasets, and then can download datasets from this list from our server using nextclade dataset get command. This is purely for convenience. But you can also load any dataset you want from your computer. So, if you found a dataset you like, or created one, you can just pass it into Nextclade as you would do with an officially downloaded one.

You can try and build your own dataset to support a new pathogen. It's quite a challenging adventure at the time. But I gathered some of the information in response to this issue in hope that it helps people: #1225

We are working on the next major version of Nextclade - version 3. In the new version there will be significant changes to datasets. Nextalign will be removed and all dataset files previously required for Nextclade will become optional - this way you could build a dataset gradually, starting small and adding new features later as needed. And we are also hoping to document creation of new datasets better and t provide tools to make the process easier. It's all coming soon. Stay tuned!