lenaschimmel / sc2rf

SARS-Cov-2 Recombinant Finder for fasta sequences

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Find and use better source for typical mutations of lineages

lenaschimmel opened this issue · comments

See this comment by @AngieHinrichs which even contains an alternative.

Thanks a lot for your detailed explanation! I'm trying to move this over here so it's easier to find for me.

(Also, if the comment thread over at pange-designation gets locked down after too many "off topic" comments, I won't be able to comment there at all. Already happened in other issues.)

And @SVN-PhD recommended that I take a look at outbreak.info for mutation prevalences.

Currently I have problems with it, neither the website nor the API seems to work properly at the moment, but I will check back later.

I suggest to look at covspectrum too.
Maybe you could open an issue there asking them (@chaoran-chen) to add a tool there to download mutations list in machine readable format.
The advantage with Cov-Spectrum would be you can choose country and period restricting the mass of mutations to the ones really circulating in that determined place and period.

Hi everyone. I was just reading this issue here. Do the following APIs look useful to you?

Mutations of BA.1 globally:
https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?pangoLineage=BA.1*&minProportion=0.01

Pango lineage with the C25708T mutation:
https://lapis.cov-spectrum.org/open/v1/sample/aggregated?nucMutations=C25708T&fields=pangoLineage

You can also further filter by location, dates (and much more). For example:
https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?pangoLineage=BA.1*&minProportion=0.01&dateFrom=2022-01-01&region=Europe

Here is the documentation:
https://lapis.cov-spectrum.org/

It uses data from GenBank (prepared and hosted by Nextstrain).

Thanks a lot, that looks perfect!

I've been working on cov-spectrum integration yesterday. It's not yet finished, but looks promising!

Also, I've been ignoring deletions and insertions until now, because they are not present in virus_properties.json and are also ignored by some other tools. Looks like cov-spectrum handles deletions just like any other mutation, which I might do as well. @chaoran-chen is there a way to also get the insertions of a lineage from cov-spectrum?

And @corneliusroemer, I saw your comment there. My code (not yet pushed, will do it in the evening) currently can get current mutations lists form cov-spectrum and either generate a virus_properties.json with mostly the same syntax as your file:

 "21K": [
            "G21989-",
            "T13195C",
            ...

or it can also include the prevalence for each mutation:

        "21K": [
            {
                "mutation": "G21989-",
                "proportion": 0.9401410657729306,
                "count": 764958
            },
            {
                "mutation": "T13195C",
                "proportion": 0.9829978750416327,
                "count": 799829
            },

I don't have a use for the absolute count, so I could also break it down to:

        "21K": {
           "G21989-": 0.9401410657729306,
           "T13195C": 0.9829978750416327,
           ...

Does any of this seem useful for your work on Nextclade?

@chaoran-chen is there a way to also get the insertions of a lineage from cov-spectrum?

Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it.

Unfortunately for SARS-CoV-2 sequences there are many genome assembly pipelines in use that do not do a good job with indels, so it may be just as well to skip them. I've seen cases where expected deletions are filled in with Ns, back-filled with reference sequence, or partially filled with read alignments that extend a bit into the deleted part of the reference genome sometimes causing false "substitutions" in the deleted region. There is definitely enough information in the substitutions alone to distinguish between the Nextstrain clades. (Although if properly assembled sequences with reliable indels are available, I suppose including indels could provide a more precise estimation of the breakpoint.)

@chaoran-chen:

Unfortunately not yet. It will be possible eventually, but I don't know yet when I will be able to implement it.

Ok, just wanted to make sure that I'm not missing anything that's already there.

@AngieHinrichs: I agree. So I'll make is so that deletions are ignored by default, but can be enabled with a flag.

Support for LAPIS / cov-spectrum is now released! The repo contains a pre-built virus_properties.json which can be updated with --rebuild-examples.

Deletions are disabled / ignored by default, but can be enabled with --enable-deletions. I'm not perfectly happy with how it works right now, but I think it's a good start:

Without deletions

screenshot-no-deletions

With deletions

screenshot-with-deletions

Thanks a lot for your input!

@lenaschimmel

Does any of this seem useful for your work on Nextclade?

This is pretty much how I started creating the virus_properties.json before switching to Nextclade data because covSpectrum doesn't have our clades

I just pushed an update with the new --mutation-threshold paramter. See this comment for more details.

I think this finally addresses @AngieHinrichs' original suggestion.

Thanks @lenaschimmel, --mutation-threshold should do the trick!

I think another tweak might be needed for --rebuild-examples, however. In the latest virus_properties.json, and after running --rebuild-examples, the lists for 21I and 21J are empty:

        "21I": [],
        "21J": [],

-- is that perhaps because all of their defining mutations are now in 21A because of the new minimum of 0.05 when rebuilding?

21J grew much larger than 21I (almost 10x as many genomes per quick stats on the UCSC/UShER tree), so the allele frequencies in 21A are heavily skewed towards 21J.

When I run on the GenBank sequences from cov-lineages/pango-designation#471 (471.genbank.aligned.fa.gz), the label for Delta is "Delta (B.1.617.2 / 21A)" but the mutations are more like 21J because they include 4181T, 6402T, 7124T, 8986T, 9053G and so on. Since the proposed recombinant is from 21J (like most would be by chance since 21J was so much more common than 21I, probably especially by the time Omicron was around though I have not checked dates), the recombination picture comes out perfect except for the '21A' label:
image

I believe there are very few Delta sequences that are 21A but not 21I or 21J, so the quickest fix might be to simply skip 21A, although I'm not sure what that would mean for mutations shared by 21I and 21J.

It should be pretty straightforward to transform my file of not-masked-for-UShER Nextstrain clade mutations to the virus_properties.json format. I will give that a try.

There are indeed not that many Deltas (lately) that are neither 21I nor 21J. They do exist, there are a few pango lineages, but for identifying current recombinants, one can drop 21A without having to worry too much.

There's an update on #10 which is also relevant to this issue. See my comment here.