normalize genes
ahwagner opened this issue · comments
We should incorporate gene normalization into our source harvesting.
Per discussion in #100, we should use the HGNC gene symbol file (ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/json/non_alt_loci_set.json) to add Entrez Gene ID to each gene concept. We should also add Ensembl Gene ID. This file will already be in use for the improvements specified in #100.
@bwalsh see https://github.com/ahwagner/g2p-aggregator/blob/v0.10/notebooks/paper_analysis/viccdb.py#L54-L96 for how I am currently handling this in my analysis.
@ahwagner
If agreeable, I'll create a PR and refresh aws buckets later today with an additional field gene_identifiers
that contains the symbol, entrez_id and ensembl_gene_id for each gene:
....
"genes": [
"ALK"
],
"gene_identifiers": [
{
"symbol": "ALK",
"entrez_id": "238",
"ensembl_gene_id": "ENSG00000171094"
}
],
Sounds good. My only remaining concern about the implementation is minor, and documented here. Ideally these changes could be incorporated in the new PR.
Thanks. I've added the following test
def test_ambiguous():
""" "ABC1" can point to both "ABCA1" and "HEATR6", """
genes = gene_enricher.get_genes('ABC1')
assert genes == [
{'symbol': u'ABCA1', 'entrez_id': u'19', 'ensembl_gene_id': u'ENSG00000165029'},
{'symbol': u'HEATR6', 'entrez_id': u'63897', 'ensembl_gene_id': u'ENSG00000068097'}
]