ohsu-comp-bio / g2p-aggregator

Associations of genomic features, drugs and diseases

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DO issues for CGI mapping

ahwagner opened this issue · comments

I've given disease matching in CGI a very thorough review. 100% of CGI interpretations have assigned, valid DOIDs. Unfortunately, this has minimal impact on the overall reduction when considering diseases. Ultimately, the reduction in matches for CGI boils down to 2 primary factors:

  1. Degree of specificity between patient and interpretation is too great (e.g. interpretations with the disease attribute of "cancer"). This accounts for a reduction of ~21% of exact positional matches. There's not much to be done here, but this difference is completely erased when allowing for broader disease matching (and is apparent in the revised figure, which I'll be putting up after some polish).
  2. Lack of term matching due to terminology and ontology structure. This accounts for a reduction of ~54% of exact positional matches. Unfortunately, this highlights the additional work we need to do when trying to harmonize diseases.

To illustrate the problem with (2), consider patients with DOID:3008, invasive ductal carcinoma, which account for 11% of the unmatched diseases (or 6% of the overall reduction). 82% of these patients have a CGI match to DOID:3458, breast adenocarcinoma, which is defined by the Human Disease Ontology as:

"A breast carcinoma that originates in the milk ducts and/or lobules (glandular tissue) of the breast."

This would be fine if breast adenocarcinoma were a parent to breast lobular carcinoma and breast ductal carcinoma, but it's listed instead as a sibling to those terms in the ontology. Frustratingly, this means it doesn't match the GENIE patient disease (invasive ductal carcinoma is the sole descendant of breast ductal carcinoma), as we only match (for good reasons) on ancestral relationships (see Figure S3b for details).