biothings / biothings_explorer

TRAPI service for BioThings Explorer

Home Page:https://api.bte.ncats.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

test alternate ordering of templates for Drug - treats - Disease creative mode query

andrewsu opened this issue · comments

@mbrush noted:

I have noted that every time I see a BTE-reasoned prediction, all of the support paths seem to be of the form Drug - treats -> Phenotype -phenotype_of-> Disease (or this with an additional subclass_of edge as a third hop). I haven't come across any more molecular/mechanistic paths behind BTE predictions.

For Drug - treats - Disease creative mode queries, the template list that BTE uses is defined in templateGroups.json:

[
  {
    "name": "Drug treats Disease",
    "subject": ["Drug", "SmallMolecule", 
                "ChemicalEntity", "ComplexMolecularMixture", "MolecularMixture"
               ],
    "predicate": ["treats", "ameliorates"],
    "object": ["Disease", "PhenotypicFeature",
               "DiseaseOrPhenotypicFeature"
              ],
    "templates": [
      "Chem-treats-DoP.json",
      "Chem-treats-PhenoOfDisease.json",
      "Chem-regulates,affects-Gene-biomarker,associated_condition-DoP.json"
    ]
  },

So the phenotype-based template is the second template executed (after direct treats edges), and it's appearing like very often, BTE fills up its entire answer list with entries from this template, so the other templates in our template library are not used.

In this issue, I propose that we systematically test the performance of each template (and some subset of template combinations) using the Benchmarks tool. Some systematic testing will give us a more data-driven basis for the selection and ordering of templates that BTE uses.

I do think that we want to showcase results based on ALL of the BTE templates. From my testing, the ONLY template I see being used is the Chem-treats-Pheno-of-Disease one. The reason for this as I understand it is that the other templates BTE has created never get used because the phenotype-based one executes first, and fills up/times out the results before other templates can be executed. As a consequence, the only support paths we ever see from BTE in the UI are instances of this phenotype-based template.

Naively, I would propose the simplest solution of increasing the limit on how many support paths can be returned so that the query finds all the phenotype-based paths, and is able to execute the other templates and return support paths based on them. But I suspect maybe there may be performance/timeout issues with this.

Alternatively, you could limit the number of results from the phenotype-based template in a way that leaves room for executing and returning paths based on the other templates. I suspect that you will find many fewer results from these other templates - based on what I know about knowledge sources serving the data needed for them, and what I've seen from other reasoners that employ similar templates/rules. But I do think these other templates represent convincing additional evidence that would be critical to surface when it is available.

I think the performance tests Andrew suggests would be a good place to start to asses the feasibility of these possible solution, and/or surface other alternative approaches to address the issue.

Adding some context:

  • earlier this year, I tested and added some molecular/mechanistic templates that would run before the pheno-template (#461 (comment), commit). However, we dropped the use of those templates due to data-modeling concerns (#699 (comment), commit, Translator Slack link)
  • I did a few manual tests of template-order at that time. I noticed that:
    • running the current templates in the current order (pheno 2nd, molecular 3rd) would result in more known answers that also score higher, compared to switching their order
    • running of the current templates 2nd often filled the 500 result limit