correctly harvest and normalize variants from Table 2
ahwagner opened this issue · comments
Ensure that we are correctly normalizing the variants from Table 2 of the manuscript.
Each of the records listed in Table 2 should normalize to the following Allele Registry amino-acid allele (NP_004439.2:p.Ala772_Met775dup
):
http://reg.clinicalgenome.org/allele/PA123579
This allele is not linked to any genomic alleles, but is associated with two ClinVar records:
http://www.ncbi.nlm.nih.gov/clinvar/?term=28914[alleleid]
http://www.ncbi.nlm.nih.gov/clinvar/?term=54152[alleleid]
From there, we can trace through to Allele Registry records:
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/by_caid?caid=CA123573
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/by_caid?caid=CA135369
Which ultimately leads to the following GRCh37 hgvs_g strings:
NC_000017.10:g.37880984_37880995dup
NC_000017.10:g.37880985_37880996dup
#130 currently adds hgvs_p_suffix to features, but adds the 3-letter amino acid codes and does not unique them. This should be revised to include both 3-letter and 1-letter amino acid suffixes and remove duplicate entries.
does not unique them.
I see where this can be corrected. https://github.com/ohsu-comp-bio/g2p-aggregator/blob/issue_129/harvester/location_normalizer.py#L353
becomes :
feature['hgvs_g_suffix'] = list(set(hgvs_g))
feature['hgvs_p_suffix'] = list(set(hgvs_p))
both 3-letter and 1-letter amino acid
I see data with the 3 letter form e.g.
"XP_011532688.1:p.Asn458="
Is there an example of the one letter form?
ref
http://varnomen.hgvs.org/recommendations/protein/variant/substitution/
edited:
# find 3 char proteins
GET associations/_search
{
"query": {
"regexp":{
"features.synonyms": {
"value": ".*:p\\.[A-z][A-z][A-z][0-9]*.*"
}
}
}
}
# 21779 hits
I'm not seeing any p.[A-Z][0-9]*[A-Z] in synonyms.
I do see aa_change
of that form in http://myvariant.info/v1/query?q=KIT%20p.S501_A502dup that we are currently not collecting
👍 on uniquing solution.
HGVS recommends one- or three- letter forms with a preference for three-letter (which we currently collect). As a consequence of this preference, their examples are all three-letter codes. However, physicians (and publications) often refer to these strings in one-letter codes (e.g. BRAF V600E, as opposed to Val600Glu). We will need to translate from the aggregated 3-letter form back to 1-letter form as a secondary annotation, which is fairly straightforward (we already use BioPython, which has a built-in function for this). By keeping both one- and three- letter codes in the feature records, searches for one or the other will populate the result table correctly.
Thank you.
Given:
synonyms = [
"ENST00000584601.5:c.3213C>G",
"ENST00000406381.6:c.3213C>G",
"NG_007503.1:g.44299C>G",
"NM_001289936.1:c.3258C>G",
"LRG_724t1:c.3213C>G",
"NR_110535.1:n.3627C>G",
"ENSP00000463427.1:p.=",
"CM000679.2:g.39727438C>G",
"NP_004439.2:p.Leu1101=",
"NM_004448.3:c.3303C>G",
"ENST00000584450.5:c.3160-251C>G",
"ENSP00000463714.1:p.=",
"ENSP00000446466.1:p.Leu1086=",
"ENST00000269571.9:c.3303C>G",
"NC_000017.11:g.39727438C>G",
"CM000679.1:g.37883691C>G",
"ENST00000445658.6:c.2475C>G",
"NC_000017.9:g.35137217C>G",
"NM_001289937.1:c.3160-251C>G",
"NP_001005862.1:p.Leu1071=",
"NP_001276865.1:p.Leu1086=",
"ENST00000541774.5:n.3258C>G",
"NM_001005862.2:c.3213C>G",
"LRG_724t2:c.3303C>G",
"ENSP00000385185.2:p.Leu1071=",
"NC_000017.10:g.37883691C>G",
"LRG_724t4:c.3258C>G",
"NP_001276866.1:p.=",
"chr17:g.37883691C>G",
"ENSP00000269571.4:p.Leu1101=",
"ENSP00000462438.1:p.Leu1071=",
"LRG_724:g.44299C>G",
"ENST00000578373.5:c.*3093C>G",
"ENSP00000404047.2:p.Leu825="
]
Does the following make sense?
hgvs_g = set()
hgvs_p = set()
for synonym in synonyms:
hgvs_variant = hp.parse_hgvs_variant(synonym)
if hgvs_variant.type == 'p':
hgvs_p.add(hgvs_variant.format().split(':')[1])
hgvs_p.add(hgvs_variant.format(conf={"p_3_letter": False}).split(':')[1])
if hgvs_variant.type == 'g':
hgvs_g.add(hgvs_variant.format().split(':')[1])
print(list(hgvs_g))
print(list(hgvs_p))
>>>
[u'g.37883691C>G', u'g.39727438C>G', u'g.35137217C>G', u'g.44299C>G']
[u'p.=', u'p.Leu825=', u'p.L1086=', u'p.L1071=', u'p.L1101=', u'p.Leu1101=', u'p.Leu1071=', u'p.Leu1086=', u'p.L825=']
@ahwagner - I've finished re-harvesting and normalizing hgvs_g, hgvs_p. Before I deploy to staging, any comments on the snippet above?
@bwalsh, thanks for the tag. I must have missed the above comment in my email; I'll review within the hour and respond here.
@bwalsh, this looks great to me. I would do a sanity check on the dataset before pushing to staging, e.g. search for p.V600E
, p.Y772_A775dup
. We should see many records (>50) of the first, and at least a few (>5) of the second.
Okay. The latter is a critical point for the review; let's deploy what we have to staging, and I'll do some more refinement and testing on my own to finish up the PR.
Thanks @bwalsh.
I downloaded all.json from https://s3-us-west-2.amazonaws.com/g2p-0.8, It does not work if import it into my ES via command: curl -XPOST "localhost:9200/associations/_bulk?pretty&refresh" --data-binary "@all.json"。 but I still search nothing after that, Would you guys like to give some hints about what I should do? thank you.
@igodfinger - thanks so much for your interest.
- The site is now available at https://search.cancervariants.org.
- The data is available at https://s3-us-west-2.amazonaws.com/g2p-0.12/index.html (arn:aws:s3:::g2p-0.12)
- To upload the data into an elastic search instance, see https://github.com/ohsu-comp-bio/g2p-aggregator/blob/v0.12/util/elastic/put_index.py. The commands necessary to create the index would be:
# after cloning the g2p-aggregator repo
cd util/elastic
./index-setup.sh
cat all.json | python put_index.py --index associations
Let us know how it goes!
It goes well, thank you very much. but some search results are different from the ones provided by website https://search.cancervariants.org. for example, search BRCA1. thank you again.
@igodfinger - thanks so much for your interest.
- The site is now available at https://search.cancervariants.org.
- The data is available at https://s3-us-west-2.amazonaws.com/g2p-0.12/index.html (arn:aws:s3:::g2p-0.12)
- To upload the data into an elastic search instance, see https://github.com/ohsu-comp-bio/g2p-aggregator/blob/v0.12/util/elastic/put_index.py. The commands necessary to create the index would be:
# after cloning the g2p-aggregator repo cd util/elastic ./index-setup.sh cat all.json | python put_index.py --index associations
Let us know how it goes!
It goes well, thank you very much. but some search results are different from the ones provided by website https://search.cancervariants.org. for example, search BRCA1. thank you again.