ohsu-comp-bio / g2p-aggregator

Associations of genomic features, drugs and diseases

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

normalize genes

ahwagner opened this issue · comments

We should incorporate gene normalization into our source harvesting.

Per discussion in #100, we should use the HGNC gene symbol file (ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/json/non_alt_loci_set.json) to add Entrez Gene ID to each gene concept. We should also add Ensembl Gene ID. This file will already be in use for the improvements specified in #100.

commented

@ahwagner
If agreeable, I'll create a PR and refresh aws buckets later today with an additional field gene_identifiers that contains the symbol, entrez_id and ensembl_gene_id for each gene:

    .... 
    "genes": [
      "ALK"
    ],
    "gene_identifiers": [
      {
        "symbol": "ALK",
        "entrez_id": "238",
        "ensembl_gene_id": "ENSG00000171094"
      }
    ],

Sounds good. My only remaining concern about the implementation is minor, and documented here. Ideally these changes could be incorporated in the new PR.

commented

Thanks. I've added the following test

def test_ambiguous():
    """ "ABC1" can point to both "ABCA1" and "HEATR6", """
    genes = gene_enricher.get_genes('ABC1')
    assert genes == [
        {'symbol': u'ABCA1', 'entrez_id': u'19', 'ensembl_gene_id': u'ENSG00000165029'},
        {'symbol': u'HEATR6', 'entrez_id': u'63897', 'ensembl_gene_id': u'ENSG00000068097'}
    ]