ohsu-comp-bio / g2p-aggregator

Associations of genomic features, drugs and diseases

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

correctly harvest and normalize variants from Table 2

ahwagner opened this issue · comments

Ensure that we are correctly normalizing the variants from Table 2 of the manuscript.

Each of the records listed in Table 2 should normalize to the following Allele Registry amino-acid allele (NP_004439.2:p.Ala772_Met775dup):
http://reg.clinicalgenome.org/allele/PA123579

This allele is not linked to any genomic alleles, but is associated with two ClinVar records:
http://www.ncbi.nlm.nih.gov/clinvar/?term=28914[alleleid]
http://www.ncbi.nlm.nih.gov/clinvar/?term=54152[alleleid]

From there, we can trace through to Allele Registry records:
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/by_caid?caid=CA123573
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/by_caid?caid=CA135369

Which ultimately leads to the following GRCh37 hgvs_g strings:
NC_000017.10:g.37880984_37880995dup
NC_000017.10:g.37880985_37880996dup

#130 currently adds hgvs_p_suffix to features, but adds the 3-letter amino acid codes and does not unique them. This should be revised to include both 3-letter and 1-letter amino acid suffixes and remove duplicate entries.

commented

@ahwagner

does not unique them.

I see where this can be corrected. https://github.com/ohsu-comp-bio/g2p-aggregator/blob/issue_129/harvester/location_normalizer.py#L353
becomes :

        feature['hgvs_g_suffix'] = list(set(hgvs_g))
        feature['hgvs_p_suffix'] = list(set(hgvs_p))

both 3-letter and 1-letter amino acid

I see data with the 3 letter form e.g.

"XP_011532688.1:p.Asn458="

Is there an example of the one letter form?

ref
http://varnomen.hgvs.org/recommendations/protein/variant/substitution/

edited:

# find 3 char proteins
GET associations/_search
{
    "query": {
        "regexp":{
            "features.synonyms": {
                "value": ".*:p\\.[A-z][A-z][A-z][0-9]*.*"
            }
        }
    }
}

# 21779 hits

I'm not seeing any p.[A-Z][0-9]*[A-Z] in synonyms.
I do see aa_change of that form in http://myvariant.info/v1/query?q=KIT%20p.S501_A502dup that we are currently not collecting

👍 on uniquing solution.

HGVS recommends one- or three- letter forms with a preference for three-letter (which we currently collect). As a consequence of this preference, their examples are all three-letter codes. However, physicians (and publications) often refer to these strings in one-letter codes (e.g. BRAF V600E, as opposed to Val600Glu). We will need to translate from the aggregated 3-letter form back to 1-letter form as a secondary annotation, which is fairly straightforward (we already use BioPython, which has a built-in function for this). By keeping both one- and three- letter codes in the feature records, searches for one or the other will populate the result table correctly.

commented

Thank you.

Given:

synonyms = [
   "ENST00000584601.5:c.3213C>G",
   "ENST00000406381.6:c.3213C>G",
   "NG_007503.1:g.44299C>G",
   "NM_001289936.1:c.3258C>G",
   "LRG_724t1:c.3213C>G",
   "NR_110535.1:n.3627C>G",
   "ENSP00000463427.1:p.=",
   "CM000679.2:g.39727438C>G",
   "NP_004439.2:p.Leu1101=",
   "NM_004448.3:c.3303C>G",
   "ENST00000584450.5:c.3160-251C>G",
   "ENSP00000463714.1:p.=",
   "ENSP00000446466.1:p.Leu1086=",
   "ENST00000269571.9:c.3303C>G",
   "NC_000017.11:g.39727438C>G",
   "CM000679.1:g.37883691C>G",
   "ENST00000445658.6:c.2475C>G",
   "NC_000017.9:g.35137217C>G",
   "NM_001289937.1:c.3160-251C>G",
   "NP_001005862.1:p.Leu1071=",
   "NP_001276865.1:p.Leu1086=",
   "ENST00000541774.5:n.3258C>G",
   "NM_001005862.2:c.3213C>G",
   "LRG_724t2:c.3303C>G",
   "ENSP00000385185.2:p.Leu1071=",
   "NC_000017.10:g.37883691C>G",
   "LRG_724t4:c.3258C>G",
   "NP_001276866.1:p.=",
   "chr17:g.37883691C>G",
   "ENSP00000269571.4:p.Leu1101=",
   "ENSP00000462438.1:p.Leu1071=",
   "LRG_724:g.44299C>G",
   "ENST00000578373.5:c.*3093C>G",
   "ENSP00000404047.2:p.Leu825="
 ]

Does the following make sense?

hgvs_g = set()
hgvs_p = set()
for synonym in synonyms:
  hgvs_variant = hp.parse_hgvs_variant(synonym)
  if hgvs_variant.type == 'p':
    hgvs_p.add(hgvs_variant.format().split(':')[1])
    hgvs_p.add(hgvs_variant.format(conf={"p_3_letter": False}).split(':')[1])
    
  if hgvs_variant.type == 'g':
    hgvs_g.add(hgvs_variant.format().split(':')[1])

print(list(hgvs_g))
print(list(hgvs_p))

>>> 
[u'g.37883691C>G', u'g.39727438C>G', u'g.35137217C>G', u'g.44299C>G']
[u'p.=', u'p.Leu825=', u'p.L1086=', u'p.L1071=', u'p.L1101=', u'p.Leu1101=', u'p.Leu1071=', u'p.Leu1086=', u'p.L825=']
commented

@ahwagner - I've finished re-harvesting and normalizing hgvs_g, hgvs_p. Before I deploy to staging, any comments on the snippet above?

@bwalsh, thanks for the tag. I must have missed the above comment in my email; I'll review within the hour and respond here.

@bwalsh, this looks great to me. I would do a sanity check on the dataset before pushing to staging, e.g. search for p.V600E, p.Y772_A775dup. We should see many records (>50) of the first, and at least a few (>5) of the second.

commented

Okay. The latter is a critical point for the review; let's deploy what we have to staging, and I'll do some more refinement and testing on my own to finish up the PR.

Thanks @bwalsh.

I downloaded all.json from https://s3-us-west-2.amazonaws.com/g2p-0.8, It does not work if import it into my ES via command: curl -XPOST "localhost:9200/associations/_bulk?pretty&refresh" --data-binary "@all.json"。 but I still search nothing after that, Would you guys like to give some hints about what I should do? thank you.

commented

@igodfinger - thanks so much for your interest.

# after cloning the g2p-aggregator repo
cd util/elastic
./index-setup.sh
cat all.json | python put_index.py --index associations

Let us know how it goes!

It goes well, thank you very much. but some search results are different from the ones provided by website https://search.cancervariants.org. for example, search BRCA1. thank you again.

@igodfinger - thanks so much for your interest.

# after cloning the g2p-aggregator repo
cd util/elastic
./index-setup.sh
cat all.json | python put_index.py --index associations

Let us know how it goes!

It goes well, thank you very much. but some search results are different from the ones provided by website https://search.cancervariants.org. for example, search BRCA1. thank you again.