Variant normalization: e.g. Assembly identifier (GRC notation, e.g. `GRCh37`)
bwalsh opened this issue · comments
I've been looking into adding a simple GA4GH beacon.
As part of that I discovered a new area for Harmonization - reference genome
-
From ga4gh beacon:
https://github.com/ga4gh/beacon-team/blob/develop/src/main/resources/avro/beacon.avdl#L76
References:
https://genome.ucsc.edu/FAQ/FAQreleases.html#release1
Issue: What standard harmonization do we need to apply?
@bwalsh @ahwagner Two suggestions:
- Standardize on GRC naming conventions, namely GRCh37 and GRCh38
- For all variant coordinates (start, stop), provide them in GRCh37. Time/priority permitting, add coordinates for GRCh38.
This is actually a big deal because GENIE called variants against GRCh37, so GRCh38 variants won't match and/or will yield spurious matches.
Yeah, I agree with @jgoecks re: GRC naming conventions. For queries to the web app / API, we may want to consider either:
- infer the reference from submitted variants (hard)
- enforce selection of a reference at query (easy, but requires additional work from users)
...I prefer the easy route, as we'll have other tough tasks ahead of us.
Should we consider using the Allele Registry to normalize incoming variants?
Then use that to represent variants various ways (build37 vs build38, genome level vs transcript level, refseq transcripts vs ensembl vs LRG transcripts, etc.)?
http://reg.clinicalgenome.org/
Here is an example variant allele returned by various approaches from the registry:
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/by_caid?caid=CA000301
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/allele?hgvs=NC_000017.10:g.7578212G%3EA
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/allele?hgvs=NC_000017.11:g.7674894G%3EA
Here is the status for v0.7 (data freeze)
- all sources are enriched with feature information based on a constructed hgvs genomic location
g.
- hit rates, per source are documented below
The enriched data is distilled into two fields feature.links and feature.synonyms.
For example for civic variant EML4-ALK C1156Y
the links
and synonyms
attributes are sourced from allele_registry.
Of course, there is more information that could be merged. Propose waiting for specific use cases.
{
"entrez_id": 238,
"end": 29445258,
"name": "EML4-ALK C1156Y",
"links": [
"http://myvariant.info/v1/variant/chr2:g.29222392C>T?assembly=hg38",
"http://reg.genome.network/refseq/RS000050",
"http://www.ncbi.nlm.nih.gov/clinvar/?term=363219[alleleid]",
"http://reg.genome.network/refseq/RS000026",
"http://reg.genome.network/allele/CA16602783",
"http://reg.genome.network/refseq/RS000002",
"http://www.ncbi.nlm.nih.gov/clinvar/variation/376340",
"http://cancer.sanger.ac.uk/cosmic/mutation/overview?id=99136",
"http://myvariant.info/v1/variant/chr2:g.29445258C>T?assembly=hg19",
"http://reg.genome.network/refseq/RS001597"
],
"start": 29445258,
"synonyms": [
"NC_000002.10:g.29298762C>T",
"chr2:g.29445258C>T",
"LRG_488:g.704175G>A",
"NC_000002.12:g.29222392C>T",
"NG_009445.1:g.704175G>A",
"CM000664.2:g.29222392C>T",
"chr2:g.29222392C>T",
"NC_000002.11:g.29445258C>T",
"CM000664.1:g.29445258C>T",
"COSM99136"
],
"biomarker_type": "snp",
"referenceName": "GRCh37",
"geneSymbol": "ALK",
"alt": "T",
"ref": "C",
"chromosome": "2"
}
--//--
allele registry hit rates
source | Total | filters | Count |
---|---|---|---|
molecularmatch_trials | 199069 | all features w/ location | 77765 |
molecularmatch_trials | allele_registry (hit) | 76368 | |
molecularmatch_trials | allele_registry (miss) | 1397 | |
jax | 5754 | all features w/ location | 4894 |
jax | allele_registry (hit) | 4893 | |
jax | allele_registry (miss) | 1 | |
brca | 5717 | all features w/ location | 5717 |
brca | allele_registry (hit) | 5055 | |
brca | allele_registry (miss) | 662 | |
oncokb | 4048 | all features w/ location | 1796 |
oncokb | allele_registry (hit) | 1792 | |
oncokb | allele_registry (miss) | 4 | |
civic | 3497 | all features w/ location | 3497 |
civic | allele_registry (hit) | 1704 | |
civic | allele_registry (miss) | 1793 | |
molecularmatch | 2079 | all features w/ location | 1665 |
molecularmatch | allele_registry (hit) | 1421 | |
molecularmatch | allele_registry (miss) | 244 | |
cgi | 1432 | all features w/ location | 1432 |
cgi | allele_registry (hit) | 589 | |
cgi | allele_registry (miss) | 843 | |
jax_trials | 1173 | all features w/ location | 1094 |
jax_trials | allele_registry (hit) | 1094 | |
jax_trials | allele_registry (miss) | 0 | |
pmkb | 609 | all features w/ location | 609 |
pmkb | allele_registry (hit) | 0 | |
pmkb | allele_registry (miss) | 609 | |
sage | 69 | all features w/ location | 0 |
sage | allele_registry (hit) | 0 | |
sage | allele_registry (miss) | 0 |