zwdzwd / transvar

TransVar - multiway annotator for precision medicine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistent position

bitsyamagu opened this issue · comments

When I submitted a mutation:
NM_001164277:c.925_928delinsTC

to Transvar
with these options:
Reverse annotation cDNA, GRCh37, RefSeq

I got "p.G309Sfs*3" and "chr11:g.118896734_118896737delinsGA"
for this mutation.

Although I checked this mutation on the UCSC Genome Browser,
the position chr11:g.118896734_118896737delinsGA is 308 of
NM_001164277(attached png).

Other similar services, Mutalyzer and VariantVaridator, returns "309"
, and it is consistent with the genome browser.

test

Thank you for your detailed explanation.
I've got the circumstance.

Now I'm trying to replace the original refGene by NCBI mapping
on a local copy of TransVar which I installed to an local server.
But it doesn't work yet.

Sorry for being silent until now. I have taken a look at this case. It seems this is brought about by the Patch 13 of GRCh37/hg19 (happened in 2018), which includes a deletion in the reference genome upstream the SLC37A4 locus. See the following for the update info and screenshot

image

image

https://genome.ucsc.edu/cgi-bin/hgc?hgsid=725242099_iR8o5gc4tCCAUvNxq78k2BYTpEwI&c=chr11&l=118898423&r=118898453&o=118849110&t=118957986&g=hg19Patch13Patches&i=jh159138.1

In fact if you look at the original RefSeq annotation gtf, there is a Note that seems to pointing to this discrepancy.

Screen Shot 2019-05-09 at 1 39 37 PM

The problem is that there is no specification of the alignment to the patch included in RefSeq gene. I would suggest you either switch to hg38 or use Ensembl annotation which treats the gap as an intron. I am not sure how from the transvar perspective whether we can completely solve this problem since the refseq gene doesn't include this particular gap and patches aren't meant to be applied (https://www.biostars.org/p/45788/). But certainly let me know if you have better suggestion.

In short, this is an extremely rare situation where the genome reference is patched to introduce a deletion in the exon sequence of a gene causing the gene to "frameshift". I have not previously encountered a situation like this before.

There are two errors that many software in this space gets wrong.

  • Packages incorrectly assume that exon genomic coordinates from NCBI, UCSC, and Ensembl are identical. NCBI aligns transcripts with splign, UCSC realigns with blat, Ensembl by direct genome annotation. Exon coordinates by these methods agree in most cases, but not all. When NCBI and UCSC don't agree, the most common case is the shuffling of a few nucleotides from one side of an intron to the other (thereby preserving the CDS), but I've also seen much larger moves. In 2014 at least, some paralogs were swapped. I summarized a bunch of these differences in 2014 here: https://www.slideshare.net/reecehart/hvp-2014-clinical-significance-of-transcript-alignment-discrepancies. There is good reason to believe that coordinates from splign are much better than those from blat. The problem is that these alignments are like optical lenses: If someone publishes a variant in one of these transcripts using blat alignments, then only blat alignments are guaranteed to project it back to the same genomic location.

  • Packages incorrectly assume that exons align perfectly or contain only substitution variants. In fact, there are many transcript alignments that contain indels due to polymorphisms (nearly all modulo 3) and sequencing errors (less frequently). Packages that fail to account for these indels when projecting variants from one sequence to another will result in locations that are off by a corresponding amount. See Garla V et al., Bioinformatics 2011. NCBI's tools, Alamut, GeneInsight, and the Python hgvs package deal with genome-transcript indels. I think the Allele Registry also handles such cases, but I've not verified. I know for certain that Mutalyzer (as of last October) and snpEff (as of several years ago) do NOT handle indels correctly. For details about issues with Mutalyzer, see Table 5 in our recent paper Wang, M. et al. Hum. Mutat. (2018). doi:10.1002/humu.23615.

You might be interested in UTA (Universal Transcript Archive), which contains transcripts from NCBI, UCSC, and Ensembl, and their alignments to multiple assemblies using coordinates from the source databases (I don't realign them myself). A public instance of UTA is available at uta.biocommons.org:5432 (postgresql) and as a docker image. Start here: https://hgvs.readthedocs.io/en/stable/installation.html#local-installation-of-uta. UTA is the only comprehensive source of this information that I know of (and it's about 9 months out of date :-( ).

For this specific case, UTA shows the following exon coordinate differences:

anonymous@uta/uta=> select * from tx_exon_aln_v where tx_ac='NM_001164277.1' and alt_ac='NC_000011.9' and alt_aln_method in ('blat','splign') order by ord;
┌─────────┬────────────────┬─────────────┬────────────────┬────────────┬─────┬────────────┬──────────┬─────────────┬───────────┬────────────────────────────────────────────────────────────────
│  hgnc   │     tx_ac      │   alt_ac    │ alt_aln_method │ alt_strand │ ord │ tx_start_i │ tx_end_i │ alt_start_i │ alt_end_i │                                                                
├─────────┼────────────────┼─────────────┼────────────────┼────────────┼─────┼────────────┼──────────┼─────────────┼───────────┼────────────────────────────────────────────────────────────────
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   0 │          0 │       58 │   118901558 │ 118901616 │ 58=                                                            
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   0 │          0 │       58 │   118901558 │ 118901616 │ 58=                                                            
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   1 │         58 │      562 │   118900941 │ 118901445 │ 504=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   1 │         58 │      562 │   118900941 │ 118901445 │ 504=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   2 │        562 │      905 │   118899931 │ 118900274 │ 343=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   2 │        562 │      905 │   118899931 │ 118900274 │ 343=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   3 │        905 │     1138 │   118898903 │ 118899136 │ 233=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   3 │        905 │     1138 │   118898903 │ 118899136 │ 233=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   4 │       1138 │     1382 │   118898337 │ 118898582 │ 145=1I99=                                                      
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   4 │       1138 │     1382 │   118898436 │ 118898582 │ 145=98D1=                                                      
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   5 │       1382 │     1541 │   118897646 │ 118897805 │ 159=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   5 │       1382 │     1541 │   118898337 │ 118898435 │ 6D2=1D1X2=2X1=3D1X1=1D1X3=1X1=5D2=4D4=3D2=2D2=6D5=1I5=2I2=2D1X4
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   6 │       1541 │     1627 │   118897312 │ 118897398 │ 86=                                                            
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   6 │       1541 │     1627 │   118897646 │ 118897805 │ 3D6=2I1=2I1X1=1X3=1X1=2I5=2I1=1X2=1X2=2I1=2X1=3I3=4I3=1D2=2X2=1
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   7 │       1627 │     1741 │   118896676 │ 118896790 │ 114=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   7 │       1627 │     1741 │   118897312 │ 118897398 │ 1=2X1=2D1=2D2=3D4=2X2=3D1X1=1D3=1X3=1D2X4=2X1=2X1=1X2=2X3=1D1X4
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   8 │       1741 │     1880 │   118895900 │ 118896039 │ 139=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   8 │       1741 │     1880 │   118896676 │ 118896790 │ 1D1X1=1X3=2D2=1D2=4D2X1=1X1=2D2=2D1X2=3X1=1D5=1I1X4=1D2=2X3=2D5
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ splign         │         -1 │   9 │       1880 │     2606 │   118895060 │ 118895786 │ 726=                                                           
│ SLC37A4 │ NM_001164277.1 │ NC_000011.9 │ blat           │         -1 │   9 │       1880 │     2606 │   118895900 │ 118896039 │ 4D1=5D5=1X1=7D3=3D1=7D1=8D2=3D4=1D1=3D2=1D2=5D2=10D5=1D1=13D1=3
└─────────┴────────────────┴─────────────┴────────────────┴────────────┴─────┴────────────┴──────────┴─────────────┴───────────┴────────────────────────────────────────────────────────────────

ord is the exon ordinal number, 0-based
coordinates are interbase (_i)
Coordinates shown are those directly from NCBI and UCSC (not realigned by me).

Note that starting in exon 5 (ord 4) that the alignment starts going belly-up. Variants up to n.1283 (c.526, I think) should map consistently (1283=1138+145).

When splign and BLAT disagree on exon coordinates, splign's alignment is nearly always more parsimonious with the genomic sequence than BLAT's and therefore more likely correct. Such is the case above: after the discrepancy in exon 5 above, splign's alignments are more plausible that BLAT's as judged by the cigar strings.

P.S. VariantValidator is a web interface to the hgvs package, which provides most of the algorithmic support. hgvs uses UTA for exon structures.

@maximilianh @reece I agree with you both. I was not aware there are many cases with indel in the alignments. It has just dawned on me that it makes a lot of sense when there are polymorphism in the dna that affects the structure of the transcript. I will take a closer look at VariantValidator/hgvs/UTA. It seems there is a lot that can be learned. Unfortunately with the current implementation of TransVar, the coordinates are inferred from the gtf. Hence if there are unannotated indels (such as in the case of RefSeq), there is yet a way to capture. But (correct me if wrong) Ensembl annotation doesn't seem to be affected by this issue since they explicitly annotate the indels?

Thanks for the insight from you both!

Thanks everyone for your replies and considerations.

I haven't been aware of these kind of cases, so I think our local repository which we are making also has similar problems, and should be checked and be improved.

The gff files should work fine as a source. The cigar string appears at the end of the line only for imperfect matches, like this:

NC_000001.10 RefSeq match /.../ ...;pct_identity_ungap=99.9308;Gap=M5230 D1 M3574

Ensembl isn't affected by this issue, but it has a different challenge: Because transcripts are derived from spans in the genomic sequence, it's harder to represent transcripts in which the genomic sequence contains a sequencing error. It's been a while since I looked at this, but I'm pretty sure that transcripts in NEFL and ALMS1 had this challenge. My information here is stale, so you should consult with Ensembl folks.

Thanks again for the insight @reece. Makes a lot of sense!