nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement

Home Page:https://clades.nextstrain.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Updating certain parts of the JSON file output from Map to Array

mitochon opened this issue · comments

We are using the JSON output file. Currently the JSON output looks like below, where each gene is a key in a dictionary

"privateAaMutations":
    {
        "HA1":
        {
            "privateSubstitutions":
            [
                {
                    "refAA": "I",
                    "codon": 0,
                    "queryAA": "H"
                },
...
        },
        "HA2":
        {
            "privateSubstitutions":  [],
            "totalPrivateSubstitutions": 0,
            "totalPrivateDeletions": 0,
            "totalReversionSubstitutions": 0
        },
...

We use spark to slurp the JSON output .. the schema looks like

 |    |-- HA-1: struct (nullable = true)
 |    |    |-- totalPrivateDeletions: long (nullable = true)
 |    |    |-- totalPrivateSubstitutions: long (nullable = true)
 |    |    |-- totalReversionSubstitutions: long (nullable = true)

In other words, when a new gene 'shows up' this will show as another column. Ideally for our use case, we'd like to have 'gene' as a column, e.g.

 |    |-- element: struct (nullable = true)
 |    |    |-- gene: String (nullable = true)
 |    |    |-- totalPrivateDeletions: long (nullable = true)
 |    |    |-- totalPrivateSubstitutions: long (nullable = true)
 |    |    |-- totalReversionSubstitutions: long (nullable = true)

This will keep the schema stable and makes it easier to run queries on the data frame.

I see the relevant code here where the BTreeMap gets serialized as a dictionary. One potential change is to have 'gene' as a field in privateAAMutation object and have that the output field updated as an array instead of a map.

Currently we're using jq to manipulate the output JSON but maybe there are other teams out there with similar use cases. Would like to get some thoughts from the nextclade team on how or whether this fits on the larger picture.

The fact that you can use Nextclade JSON to seed your particular database at all is a little miracle and I would not recommend to rely on it going forward. As mentioned in the docs, JSON output is unstable. Also there will be massive breaking changes in the coming weeks in Nextclade v3.

JSON format is used for internal communication between different parts of Nextclade, and as you've discovered, this is just a serialized internal struct. It naturally changes during routine development.

As a small research lab we are focusing on science and we don't have time to commit to maintain a stable external JSON format at this point, and will not have resources to adjust to the requirements of downstream projects. We experiment and break things a lot and reserve a right to change the JSON format at any time without warning.

So while you can submit a PR to change the format now (assuming there is no loss of functionality and correctness, we will likely accept it), I don't see it helping much in long term.

One thing that we considered to facilitate usage of JSON output is to provide a JSON schema for the format, but this would not help much in your use case.

Perhaps writing a middleware tool to ingest TSV output is a better solution for downstream projects? TSV output is much more stable - it follows semantic versioning. You can then maintain a stable output format of your liking, and to open-source the tool for the community who happen to use your particular toolset.

Also, Spark seems like a massive overkill to me. Internally our scientists use TSV with pandas/polars and it works decently well. Maybe this could also fit to your project?

If you have other ideas let us know.

Thanks for your comments and suggestions.
I discussed with a few of our team members and we will look into using the TSV output in lieu of JSON.
We do want to thank you for your work and making this tool available.
This has enabled us to do research and help us made some contributions in the public health space.