nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement

Home Page:https://clades.nextstrain.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Files required to configure a custom dataset when using nextclade for pathogens not provided by nextclade

gksruf0323 opened this issue · comments

I would like to ask you to use nextclade better.

I want to analyze new pathogens that do not correspond to the species provided by nextclade (covid, mpox, etc.) using nextclade. I understand that a custom dataset is required for this.

When creating a custom dataset

  1. reference tree
  2. root sequence
  3. quality control
  4. Virus properties
  5. gene map
  6. PCR primers

I found out from your explanation that you need a total of six, but I couldn't find a way to create them. I would appreciate it if you could tell me how to make it.

Thanks!

Hi @gksruf0323 right now the only way is to look at the existing datasets and to try to repeat the same. You can find source data here:
https://github.com/nextstrain/nextclade_data/tree/master/data/datasets

There is also a little bit of info in the docs:
https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html#creating-a-custom-dataset

Just copy the folder of the dataset from a pathogen that is the most resembling your new pathogen, and modify the files to fit your needs. Most files should be straightforward (in the docs you can find meaning of each of the files). For things you don't need, you can leave empty/dummy files.

In particular, building a good reference tree JSON might be challenging. The trees for the current datasets are prepared by Cornelius using Nextstrain Augur here:

https://github.com/neherlab/nextclade_data_workflows

Refer to Augur's docs and tutorials for more info.

You could also find some thoughts, tips and tricks in these PRs:

but they are very chaotic and not finished.

Here is a more substantial guide and template, but it is also still in the works:

https://github.com/neherlab/nextclade-dataset-template

Note that, we are in the process of preparing the next major release of Nextclade, version 3, which will come with breaking changes, including the new dataset format. You can sneak peek on the v3 work happening here, but it is very early and things will still break a lot in the coming weeks:

One of the goals of v3 is to make dataset creation simpler and to document the process. As well as to allow community to share new datasets more easily. Stay tuned!

I could get a lot of help.

Thank you!