ohsu-comp-bio / g2p-aggregator

Associations of genomic features, drugs and diseases

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Variant normalization: e.g. Assembly identifier (GRC notation, e.g. `GRCh37`)

bwalsh opened this issue · comments

commented

I've been looking into adding a simple GA4GH beacon.
As part of that I discovered a new area for Harmonization - reference genome

References:
https://genome.ucsc.edu/FAQ/FAQreleases.html#release1

Issue: What standard harmonization do we need to apply?

@jgoecks @ahwagner : thoughts?

@bwalsh @ahwagner Two suggestions:

  1. Standardize on GRC naming conventions, namely GRCh37 and GRCh38
  2. For all variant coordinates (start, stop), provide them in GRCh37. Time/priority permitting, add coordinates for GRCh38.

This is actually a big deal because GENIE called variants against GRCh37, so GRCh38 variants won't match and/or will yield spurious matches.

Yeah, I agree with @jgoecks re: GRC naming conventions. For queries to the web app / API, we may want to consider either:

  • infer the reference from submitted variants (hard)
  • enforce selection of a reference at query (easy, but requires additional work from users)

...I prefer the easy route, as we'll have other tough tasks ahead of us.

Should we consider using the Allele Registry to normalize incoming variants?

Then use that to represent variants various ways (build37 vs build38, genome level vs transcript level, refseq transcripts vs ensembl vs LRG transcripts, etc.)?

http://reg.clinicalgenome.org/

Here is an example variant allele returned by various approaches from the registry:
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/by_caid?caid=CA000301
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/allele?hgvs=NC_000017.10:g.7578212G%3EA
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/allele?hgvs=NC_000017.11:g.7674894G%3EA

commented

@jgoecks @ahwagner : refactored results

Note: cgi is missing, I'm following up with David T

image

commented

+cgi
switched brca to 37

image

commented

Here is the status for v0.7 (data freeze)

  • all sources are enriched with feature information based on a constructed hgvs genomic location g.
  • hit rates, per source are documented below

The enriched data is distilled into two fields feature.links and feature.synonyms.

For example for civic variant EML4-ALK C1156Y
the links and synonyms attributes are sourced from allele_registry.

Of course, there is more information that could be merged. Propose waiting for specific use cases.

  {
  "entrez_id": 238,
  "end": 29445258,
  "name": "EML4-ALK C1156Y",
  "links": [
    "http://myvariant.info/v1/variant/chr2:g.29222392C>T?assembly=hg38",
    "http://reg.genome.network/refseq/RS000050",
    "http://www.ncbi.nlm.nih.gov/clinvar/?term=363219[alleleid]",
    "http://reg.genome.network/refseq/RS000026",
    "http://reg.genome.network/allele/CA16602783",
    "http://reg.genome.network/refseq/RS000002",
    "http://www.ncbi.nlm.nih.gov/clinvar/variation/376340",
    "http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=99136",
    "http://myvariant.info/v1/variant/chr2:g.29445258C>T?assembly=hg19",
    "http://reg.genome.network/refseq/RS001597"
  ],
  "start": 29445258,
  "synonyms": [
    "NC_000002.10:g.29298762C>T",
    "chr2:g.29445258C>T",
    "LRG_488:g.704175G>A",
    "NC_000002.12:g.29222392C>T",
    "NG_009445.1:g.704175G>A",
    "CM000664.2:g.29222392C>T",
    "chr2:g.29222392C>T",
    "NC_000002.11:g.29445258C>T",
    "CM000664.1:g.29445258C>T",
    "COSM99136"
  ],
  "biomarker_type": "snp",
  "referenceName": "GRCh37",
  "geneSymbol": "ALK",
  "alt": "T",
  "ref": "C",
  "chromosome": "2"
}

--//--

allele registry hit rates

source Total filters Count
molecularmatch_trials 199069 all features w/ location 77765
molecularmatch_trials   allele_registry (hit) 76368
molecularmatch_trials   allele_registry (miss) 1397
jax 5754 all features w/ location 4894
jax   allele_registry (hit) 4893
jax   allele_registry (miss) 1
brca 5717 all features w/ location 5717
brca   allele_registry (hit) 5055
brca   allele_registry (miss) 662
oncokb 4048 all features w/ location 1796
oncokb   allele_registry (hit) 1792
oncokb   allele_registry (miss) 4
civic 3497 all features w/ location 3497
civic   allele_registry (hit) 1704
civic   allele_registry (miss) 1793
molecularmatch 2079 all features w/ location 1665
molecularmatch   allele_registry (hit) 1421
molecularmatch   allele_registry (miss) 244
cgi 1432 all features w/ location 1432
cgi   allele_registry (hit) 589
cgi   allele_registry (miss) 843
jax_trials 1173 all features w/ location 1094
jax_trials   allele_registry (hit) 1094
jax_trials   allele_registry (miss) 0
pmkb 609 all features w/ location 609
pmkb   allele_registry (hit) 0
pmkb   allele_registry (miss) 609
sage 69 all features w/ location 0
sage   allele_registry (hit) 0
sage   allele_registry (miss) 0