nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement

Home Page:https://clades.nextstrain.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Custom datasets via URL param `dataset-url`

joverlee521 opened this issue · comments

One question that came up in yesterday's Nextstrain office hours is how to provide the name of Nextclade dataset when using the dataset-url param for custom datasets. i.e. how to replace "Untitled dataset" when loading a custom dataset:

Screen Shot 2023-04-28 at 1 23 50 PM

Reading through the URL parameters docs, I thought this could be done with the dataset-name param, but that turned out to not be true.

Then I looked through the source code and found that the fetchSingleDatasetFromUrl function does try to parse the tag.json file for information about the dataset. Is the only way to provide information about a dataset via the tag.json file?

@victorlin found the following in the Nextclade dataset docs:

Dataset also includes a file tag.json which contains version tag and other properties of the dataset. This file is currently not used by Nextclade and serves only for informational purposes.

Am I understanding correctly that the tag.json file is not used by the Nextclade CLI, but it does provide dataset information that gets displayed on Nextclade Web?

I was able to update text displayed for an example custom dataset:

Screen Shot 2023-04-28 at 2 26 56 PM

I added the following to the tag.json file:

"attributes": {
    "name": {
      "value": "CUSTOM NAME",
      "valueFriendly": "CUSTOM FRIENDLY"
    },
    "reference": {
      "value": "CY121680",
      "valueFriendly": "A/California/07/2009(H1N1)"
    },
    "tag": {
      "value": "2023-04-28T00:00:00Z"
    }
}

One question that came up in yesterday's Nextstrain office hours is how to provide the name of Nextclade dataset when using the dataset-url param for custom datasets

Whoa! This feature was developed mostly for internal use, for Richard and Corneius to experiment when preparing datasets for new pathogens, but it seems it also gained some traction in the community.

@joverlee521 Thanks for digging into this. I even forgot that I fetch tag.json there, so if you'd ask me yesterday, I'd say there is no way to change the name :) This also means that the docs about tag.json are no longer accurate.

@emmahodcroft @corneliusroemer Looks like Jover has figured out how to remove that ugly "Untitled dataset" text we've been discussing recently :)


A few words/thoughts regarding the upcoming changes in Nextclade v3:

We are planning some pretty big revamp of dataset files for Nextclade v3 sometimes in the next few months and this is a good opportunity to also think about how custom single datasets and custom dataset collections could be improved. To become a user-facing API this would need some more formal format spec and documentation for the single dataset description data (like tag.json currently, but better) and for indexing of dataset collections (like index_v2.json currently, but better).

We might also think of some more user-friendly ways of accessing custom datasets rather than URL parameters, perhaps a UI similar to Nextstrain community builds, with either centralized list of third-party source URLs or with third-party source URLs stored in browser's local storage, or both.

Creating datasets is probably still a very rare activity outside of the team (and even in the team) so I am not sure how high the priority that would be, but we keep hearing more and more about people either interested in building datasets or even trying to.

Whoa! This feature was developed mostly for internal use, for Richard and Corneius to experiment when preparing datasets for new pathogens, but it seems it also gained some traction in the community.

Hah! We asked Richard to join one of the Nextstrain office hours to help with the custom Nextclade datasets questions and he's the one who pointed us to this feature!

We might also think of some more user-friendly ways of accessing custom datasets rather than URL parameters, perhaps a UI similar to Nextstrain community builds, with either centralized list of third-party source URLs or with third-party source URLs stored in browser's local storage, or both.

This sounds great! Nextclade has been such a great tool for SC2, I definitely understand the want for datasets for other pathogens. With how much work it is to maintain a dataset (@corneliusroemer has been amazing with this 🌟), supporting more community datasets is definitely a good direction.

Nextclade v3 release should have mostly solved this. We now have a dataset creator guide and community datasets. The new pathogen.json file stores dataset name in a unified manner, and this should make it easier to make custom datasets pretty and informative.

Comment or open a new issue if there's remaining problems or suggestions.