normalize genes

Question

normalize genes

ahwagner opened this issue 7 years ago · comments

Alex H. Wagner, PhD commented 7 years ago

We should incorporate gene normalization into our source harvesting.

Per discussion in #100, we should use the HGNC gene symbol file (ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/json/non_alt_loci_set.json) to add Entrez Gene ID to each gene concept. We should also add Ensembl Gene ID. This file will already be in use for the improvements specified in #100.

Alex H. Wagner, PhD · Answer 1 · Thu Apr 05 2018 00:23:29 GMT+0800 (China Standard Time)

@bwalsh see https://github.com/ahwagner/g2p-aggregator/blob/v0.10/notebooks/paper_analysis/viccdb.py#L54-L96 for how I am currently handling this in my analysis.

Brian · Answer 2 · Wed Apr 11 2018 00:08:21 GMT+0800 (China Standard Time)

@ahwagner
If agreeable, I'll create a PR and refresh aws buckets later today with an additional field gene_identifiers that contains the symbol, entrez_id and ensembl_gene_id for each gene:

    .... 
    "genes": [
      "ALK"
    ],
    "gene_identifiers": [
      {
        "symbol": "ALK",
        "entrez_id": "238",
        "ensembl_gene_id": "ENSG00000171094"
      }
    ],

Alex H. Wagner, PhD · Answer 3 · Wed Apr 11 2018 03:09:56 GMT+0800 (China Standard Time)

Sounds good. My only remaining concern about the implementation is minor, and documented here. Ideally these changes could be incorporated in the new PR.

Brian · Answer 4 · Wed Apr 11 2018 03:47:56 GMT+0800 (China Standard Time)

Thanks. I've added the following test

def test_ambiguous():
    """ "ABC1" can point to both "ABCA1" and "HEATR6", """
    genes = gene_enricher.get_genes('ABC1')
    assert genes == [
        {'symbol': u'ABCA1', 'entrez_id': u'19', 'ensembl_gene_id': u'ENSG00000165029'},
        {'symbol': u'HEATR6', 'entrez_id': u'63897', 'ensembl_gene_id': u'ENSG00000068097'}
    ]