Find and use better source for typical mutations of lineages

Question

Find and use better source for typical mutations of lineages

lenaschimmel opened this issue 2 years ago · comments

See this comment by @AngieHinrichs which even contains an alternative.

Thanks a lot for your detailed explanation! I'm trying to move this over here so it's easier to find for me.

(Also, if the comment thread over at pange-designation gets locked down after too many "off topic" comments, I won't be able to comment there at all. Already happened in other issues.)

Lena Schimmel · Answer 1 · Tue Mar 15 2022 09:58:17 GMT+0800 (China Standard Time)

And @SVN-PhD recommended that I take a look at outbreak.info for mutation prevalences.

Currently I have problems with it, neither the website nor the API seems to work properly at the moment, but I will check back later.

Federico Gueli · Answer 2 · Tue Mar 15 2022 15:54:23 GMT+0800 (China Standard Time)

I suggest to look at covspectrum too.
Maybe you could open an issue there asking them (@chaoran-chen) to add a tool there to download mutations list in machine readable format.
The advantage with Cov-Spectrum would be you can choose country and period restricting the mass of mutations to the ones really circulating in that determined place and period.

Chaoran Chen · Answer 3 · Tue Mar 15 2022 15:59:14 GMT+0800 (China Standard Time)

Hi everyone. I was just reading this issue here. Do the following APIs look useful to you?

Mutations of BA.1 globally:
https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?pangoLineage=BA.1*&minProportion=0.01

Pango lineage with the C25708T mutation:
https://lapis.cov-spectrum.org/open/v1/sample/aggregated?nucMutations=C25708T&fields=pangoLineage

You can also further filter by location, dates (and much more). For example:
https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?pangoLineage=BA.1*&minProportion=0.01&dateFrom=2022-01-01&region=Europe

Here is the documentation:
https://lapis.cov-spectrum.org/

It uses data from GenBank (prepared and hosted by Nextstrain).

Lena Schimmel · Answer 4 · Tue Mar 15 2022 16:18:05 GMT+0800 (China Standard Time)

Thanks a lot, that looks perfect!

Lena Schimmel · Answer 5 · Wed Mar 16 2022 19:58:47 GMT+0800 (China Standard Time)

I've been working on cov-spectrum integration yesterday. It's not yet finished, but looks promising!

Also, I've been ignoring deletions and insertions until now, because they are not present in virus_properties.json and are also ignored by some other tools. Looks like cov-spectrum handles deletions just like any other mutation, which I might do as well. @chaoran-chen is there a way to also get the insertions of a lineage from cov-spectrum?

And @corneliusroemer, I saw your comment there. My code (not yet pushed, will do it in the evening) currently can get current mutations lists form cov-spectrum and either generate a virus_properties.json with mostly the same syntax as your file:

 "21K": [
            "G21989-",
            "T13195C",
            ...

or it can also include the prevalence for each mutation:

        "21K": [
            {
                "mutation": "G21989-",
                "proportion": 0.9401410657729306,
                "count": 764958
            },
            {
                "mutation": "T13195C",
                "proportion": 0.9829978750416327,
                "count": 799829
            },

I don't have a use for the absolute count, so I could also break it down to:

        "21K": {
           "G21989-": 0.9401410657729306,
           "T13195C": 0.9829978750416327,
           ...

Does any of this seem useful for your work on Nextclade?

Chaoran Chen · Answer 6 · Wed Mar 16 2022 20:27:44 GMT+0800 (China Standard Time)

@chaoran-chen is there a way to also get the insertions of a lineage from cov-spectrum?

Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it.

Angie Hinrichs · Answer 7 · Thu Mar 17 2022 00:22:37 GMT+0800 (China Standard Time)

Unfortunately for SARS-CoV-2 sequences there are many genome assembly pipelines in use that do not do a good job with indels, so it may be just as well to skip them. I've seen cases where expected deletions are filled in with Ns, back-filled with reference sequence, or partially filled with read alignments that extend a bit into the deleted part of the reference genome sometimes causing false "substitutions" in the deleted region. There is definitely enough information in the substitutions alone to distinguish between the Nextstrain clades. (Although if properly assembled sequences with reliable indels are available, I suppose including indels could provide a more precise estimation of the breakpoint.)

Lena Schimmel · Answer 8 · Thu Mar 17 2022 00:39:07 GMT+0800 (China Standard Time)

@chaoran-chen:

Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it.

Ok, just wanted to make sure that I'm not missing anything that's already there.

@AngieHinrichs: I agree. So I'll make is so that deletions are ignored by default, but can be enabled with a flag.

Lena Schimmel · Answer 9 · Thu Mar 17 2022 04:26:53 GMT+0800 (China Standard Time)

Support for LAPIS / cov-spectrum is now released! The repo contains a pre-built virus_properties.json which can be updated with --rebuild-examples.

Deletions are disabled / ignored by default, but can be enabled with --enable-deletions. I'm not perfectly happy with how it works right now, but I think it's a good start:

Without deletions

With deletions

Thanks a lot for your input!

Cornelius Roemer · Answer 10 · Thu Mar 17 2022 08:57:50 GMT+0800 (China Standard Time)

@lenaschimmel

Does any of this seem useful for your work on Nextclade?

This is pretty much how I started creating the virus_properties.json before switching to Nextclade data because covSpectrum doesn't have our clades

Lena Schimmel · Answer 11 · Sat Mar 19 2022 08:08:29 GMT+0800 (China Standard Time)

I just pushed an update with the new --mutation-threshold paramter. See this comment for more details.

I think this finally addresses @AngieHinrichs' original suggestion.

Angie Hinrichs · Answer 12 · Tue Mar 22 2022 08:25:32 GMT+0800 (China Standard Time)

Thanks @lenaschimmel, --mutation-threshold should do the trick!

I think another tweak might be needed for --rebuild-examples, however. In the latest virus_properties.json, and after running --rebuild-examples, the lists for 21I and 21J are empty:

        "21I": [],
        "21J": [],

-- is that perhaps because all of their defining mutations are now in 21A because of the new minimum of 0.05 when rebuilding?

21J grew much larger than 21I (almost 10x as many genomes per quick stats on the UCSC/UShER tree), so the allele frequencies in 21A are heavily skewed towards 21J.

When I run on the GenBank sequences from cov-lineages/pango-designation#471 (471.genbank.aligned.fa.gz), the label for Delta is "Delta (B.1.617.2 / 21A)" but the mutations are more like 21J because they include 4181T, 6402T, 7124T, 8986T, 9053G and so on. Since the proposed recombinant is from 21J (like most would be by chance since 21J was so much more common than 21I, probably especially by the time Omicron was around though I have not checked dates), the recombination picture comes out perfect except for the '21A' label:

I believe there are very few Delta sequences that are 21A but not 21I or 21J, so the quickest fix might be to simply skip 21A, although I'm not sure what that would mean for mutations shared by 21I and 21J.

It should be pretty straightforward to transform my file of not-masked-for-UShER Nextstrain clade mutations to the virus_properties.json format. I will give that a try.

Cornelius Roemer · Answer 13 · Wed Mar 23 2022 04:23:25 GMT+0800 (China Standard Time)

There are indeed not that many Deltas (lately) that are neither 21I nor 21J. They do exist, there are a few pango lineages, but for identifying current recombinants, one can drop 21A without having to worry too much.

Lena Schimmel · Answer 14 · Wed Mar 23 2022 06:05:51 GMT+0800 (China Standard Time)

There's an update on #10 which is also relevant to this issue. See my comment here.