Pathfinder Prototype

Question

Pathfinder Prototype

tokebe opened this issue 4 months ago · comments

Jackson Callaghan commented 4 months ago

One priority for the current Translator sprint is a working Pathfinder prototype. This prototype must satisfy a specific input/output format, and should return adequate results for 4 example queries.

Problem Overview

Query format

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": [
            "some:CURIE"
          ]
        },
        "un": {
          "categories": [
            "biolink:NamedThing"
          ]
        },
        "n2": {
          "ids": [
            "some:CURIE"
          ]
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "un",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e1": {
          "subject": "un",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e2": {
          "subject": "n0",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        }
      }
    }
  }
}

The result format roughly matches the input format; 3 primary edges, with the two "pinned" query nodes and some intermediate node, and each edge being "artificial", with an associated support graph, as in preset inferred-mode queries.

Example result

{
  "node_bindings": {
    "n0": [{"id": "n0_pinned_node"}],
    "un": [{"id": "some_intermediate_node"}],
    "n2": [{"id": "n2_pinned_node"}]
  },
  "analyses": [
    {
      "resource_id": "infores:biothings-explorer",
      "edge_bindings": {
        "e0": [{"id": "inferred-n0-related_to-un"}], // has support graph
        "e1": [{"id": "inferred-un-related_to-n2"}], // has support graph
        "e2": [{"id": "inferred-n0-related_to-n2"}] // has support graph
      },
      "score": 1
    }
  ]
}

Our 4 example queries are as follows:

how does imatinib affect asthma? (drug to disease)
how does resveratrol affect glyoxalase? (chemical to gene)
In some rare families, Crohn's disease and Parkinson's disease cooccur. Is there a possible genetic link between these two apparently distinct coditions? (disease to disease)
A GWAS signal suggests that a variant in the gene SLC6A20 is quite strongly associated with susceptibility to COVID19. What could be the molecular mechanism(s), based on the functions of SLC6A20, that explain(s) why a change of its biological activity affects vulnerabililty to the SARS-Cov2 virus? (Gene to disease)

The important differences are that:

There are 3 inferred edges each with support graphs, rather than 1 as in previous inferred-mode queries,
For every intermediate node in an "answer" to the query, BTE must generate a result.

Explaining further, for every intermediate node between the two pinned nodes, BTE must generate a result with that intermediate node as the unpinned node, and support graphs for edges on either side representing the rest of the path on either side of that node, as well as the "overall" edge having a support graph representing the full path.

This does mean that BTE will be generating many "redundant" results which bind essentially the same information (aside from the unpinned node) in different "view-frames".

Approach

In order to approach this problem within BTE's existing system, several steps must occur in query execution:

Recognize a Pathfinder query
Recognize this specific query structure and enter a specific query execution mode/control-flow
Select templates
Select templates separately from the existing templates. This may be accomplished by registering templates in the templateGroups file with the flag "pathfinder": true and ensuring that flag is checked when obtaining Pathfinder templates.
Execute templates
Fill out these templates and execute them in the normal inferred-mode way, resulting in a merged result set of each template.
Special results formatting
Iterate over the existing result support graphs to generate a new results set with proper bindings and structure.

Important Considerations

These steps should be fairly straightforward to implement, with a few complications:

@colleenXu points out some issues BTE currently faces with multiple pinned nodes in inferred-mode execution.
The inferred-mode template results-merging code likely isn't able to properly handle a multi-hop inferred query. We can probably get around this by generating an artificial query graph which is represented as a single hop, and feeding that artificial query graph to the inferred mode handler, rather than the "true" query graph. This should cause BTE to re-bind template results into a single edge with a single support graph, which can then be easily traversed for Step 4.
If the above trick for the inferred handler works, it'll be returning 1 result with many support graphs (because both nodes are pinned). So, we'll have to iterate over each support graph, to generate the final results in a one->many relationship.
Because the inferred-mode handler will be generating a single merged result, we won't be able to automatically stop execution at 500 results. The inferred mode handler will have to be modified to count the number of support graphs, multiplying each by the number of intermediate nodes, and check that number against the limit when in Pathfinder mode.
Because the inferred-mode handler will be merging into a single result, the score would cease to be meaningful. The inferred mode handler will have to be modified to store scores separately before merging so that each support graph has an associated score. Each score will likely be mapped to multiple final results since each path will generate a number of results equal to the number of intermediate nodes it has. That much will be a limitation we'll have to accept.
We can save a little space in the response TRAPI by binding the same e2 edge (and associated support graph) for each result a given answer path generates.

Jackson Callaghan · Answer 1 · Fri Mar 15 2024 03:44:24 GMT+0800 (China Standard Time)

@colleenXu Please review and let me know if this aligns with your understanding and covers all the bases. Additionally, let me know if this explanation is sufficient for you to work on an example template.

Jackson Callaghan · Answer 2 · Fri Mar 15 2024 03:45:00 GMT+0800 (China Standard Time)

Please note that I'm currently asking for clarification regarding the 3rd example question.

Colleen Xu · Answer 3 · Wed Mar 20 2024 05:48:09 GMT+0800 (China Standard Time)

I have some feedback on the "Important Considerations". I think it'll be helpful for @tokebe and I to discuss...

(1)

I'm unsure of the assumption that the inferred-mode-handler will produce 1 mega-result with many support-graphs after running the templates.

A template (n0 -> inter_1 -> n2) can return > 1 result if the intermediate nodes aren't set to is_set: true - which is what I plan to do. In this situation, there'll be separate results for each unique set of intermediate nodes, which is kinda partway to the desired output…

Then I'm not sure if the inferred-mode-handler logic will continue to keep those results separate vs merge them into 1 mega-result…

(2)

I'm confused on how the number of support-graphs relates to the number of final-formatted results, ex: point 3's "count the number of support graphs, multiplying each by the number of intermediate nodes"

I assumed that there'd be 1 final-formatted result per unique intermediate node…so if that intermediate node was in multiple template results (aka diff support-graphs), those would all be put together into that final-formatted result.

So then it would make more sense to count the number of intermediate nodes (aka total number of unique nodes - 2) after each template and then stop at >=500?
And this would affect the scoring of the final-formatted results? (maybe adding the scores from the template results together)

(3)

In this situation + our current subclassing code, the subclasses of n0/n2 entity IDs will count as intermediate nodes. Is this a problem / an issue to ask Translator about? I'm not sure if the other teams have implemented this subclassing feature and will encounter this…

(4)

I'm not sure on the last point, because I thought the e2 support-graph would still differ between each final-formatted result. It sounds like the e2 support-graph is basically the union of the edge sets in the e0 and e1 subgraphs. AKA it's still a subgraph containing n0, 1 specific intermediate, and n2.

Colleen Xu · Answer 4 · Wed Mar 20 2024 05:50:05 GMT+0800 (China Standard Time)

Regarding point 1 of the "Important Considerations", I'm going to review the problem again to see if it's still relevant...

Jackson Callaghan · Answer 5 · Wed Mar 20 2024 23:44:19 GMT+0800 (China Standard Time)

Responding to @colleenXu's feedback:

That's not an assumption; It's a summary of how the inferred-mode handler (currently) works. The inferred-mode handler doesn't maintain template results, it completely mutates them when merging the template response. Template results (of which there are many in conventional inferred mode as well) are mapped back to the one-hop inferred query (hence why I expect we'll need to generate an "artificial" one-hop query to work with the handler) based on its two nodes. Since every result will map to those two nodes, and they're both pinned in the query, every template result will be merged regardless of pathway, with the pathways being represented as support graphs. This is how the handler has worked since the support graph refactor.
You're correct, what I wrote was a messy heuristic. I see where more complicated score merging might come from, since multiple support graphs could have overlapping intermediate nodes which nominally would show up in some sort of combined state for the final result format. That said, I think for this prototype we should keep it simple and assume that support graphs don't overlap, even if they do. That simplifies the traversal code significantly (and thus the time to implement) at the cost of maybe a few result topologies, which should be absolutely fine for a rough prototype.
I don't think we need to ask about how to deal with this. UI can handle nested support graphs, so we just need to make sure we're handling subclass support graphs specially. We can either leave them as support graphs on edges within the result support graph, or we can flatten them out to the same level. @rjawesome I leave it to you to decide whichever is easiest to implement.
Well, yes, it's a union of the e0 and e1 support graphs. Which...means it'll have every intermediate node? I'm not sure why you think it'd have 1 specific intermediate and not every intermediate, which would be contained within the support graphs being union'd...Regardless, the intent for that edge that we've been told is that it's the entire pathway, nothing cut. So, it'd be the same pathway for every result that is using two subsets of that pathway.

Colleen Xu · Answer 6 · Thu Mar 21 2024 08:42:37 GMT+0800 (China Standard Time)

Update

Jackson @tokebe: here's some slides based on our discussions of this pathfinder prototype so far. It's editable, so you should be able to adjust things. This should be useful for discussions, including with @rjawesome.

On point (2):

the heuristic actually refers to the "number of intermediate nodes in a template", right? I was confused because it sounds like "the total number of intermediate nodes in the KG".
so one way to restate it is: for each template run, record (number of results) * (number of intermediate nodes in that template's QGraph, will probably be 1 or 2). Sum these together as templates are run. Stop when this number >= 500.

Here's some stuff that came out of our 1-on-1 discussion today:

We're now on the same page: at the end of running templates/"normal inferred-mode execution", there'll be 1 result with 1 "mega-edge" between the pinned nodes n0 and n2. This "mega-edge" will have tons of support-graphs. 1 of these support-graphs = 1 result from a template
See question above. I think we're basically on the same page, but I want to clarify the wording.
@rjawesome We basically agreed that the nested subgraphs for subclassing stuff should be basically ignored for this formatting. It'd be more work to unpack them and it doesn't feel quite right to treat the subclass nodes as "intermediate node answers".
we're on the same page regarding e2 now. When iterating through each support-graph in the "mega-edge", this support-graph can be kept and used for e2. If there's multiple intermediate nodes that get split into different "new results", they can use the same e2/subgraph-ref.

Colleen Xu · Answer 7 · Thu Mar 21 2024 14:31:06 GMT+0800 (China Standard Time)

@tokebe @rjawesome

I'm putting the pathfinder template-groups and templates here: https://github.com/biothings/bte_trapi_query_graph_handler/tree/pathfinder-templates/data

[EDIT: The notes below aren't using the potential answers Sui posted]

Notes on Case A: how does imatinib affect asthma? (drug - disease)

assuming the input curies are PUBCHEM.COMPOUND:5291 (imatinib), MONDO:0004979 (asthma)
- but I'm seeing some discussion of "allergic asthma". Which may be MONDO:0004784.
1st template (saved response): gene intermediate. Runs in 1 min 30s, 684 results, top result is KIT
2nd template (saved response): gene + cell intermediates. Runs in 51s, 380 results, top result is KIT + mast cell
3rd template (saved response): gene + PhysiologicalProcess/pathway intermediates. Runs in 5 min 33s, 1876 results. Results include previously interesting PhysiologicalProcess intermediate nodes like "immune response", "bronchoconstriction", "cytokine production"

Notes on Case B: how does resveratrol affect glyoxalase? (chemical - gene)

assuming the input curies are PUBCHEM.COMPOUND:445154 (Resveratrol), NCBIGene:2739 (glyoxalase, GLO1)
1st template (saved response): 1 gene intermediate. Runs in 28 s, 84 results, result 3 is NFE2L2 (NCBIGene:4780). Didn't see any MAPK.
2nd template (response too large to attach): 2 gene intermediates. Runs in 1 min 1 s, 4472 results. Result 10 includes NFE2L2. Result 116 (or around there) is a possible answer (Resveratrol ➡️ MAPK1 (ERK2, NCBIGene:5594) ➡️ NFE2L2 ➡️ GLO1).

Possible answer notes

From ref (Translator Slack): Resveratrol ➡️ ERK ➡️ Nrf2 ➡️ genetic element "antioxidant response element" ➡️ glyoxalase
- ERK (extracellular signal-regulated kinase) genes or proteins:
  - MAPK3 (NCBIGene:5595) aka ERK1
  - MAPK1 (NCBIGene:5594) aka ERK2
- NFE2L2 gene NCBIGene:4780 == Nrf2 protein (nuclear factor erythroid 2-related factor 2).
  - BUT there'll be text-mining related confusion >.<. There's another Nrf2 protein (nuclear respiratory factor 2 == GABPA gene) that's also a transcription factor. (ref, was linked by this paper section 2.1)
- genetic feature "antioxidant response element": SRI NameResolver says there's a MESH ChemicalEntity term for this.
From Translator Slack:
- RELA gene (NCBIGene:5970) == Transcription factor p65 also known as nuclear factor NF-kappa-B p65 subunit protein
Negative controls?
- Look for MAPK8-14. Ref said it wasn't using p38 (MAPK11-14) or c-Jun N-terminal kinase (JNK) pathways (MAPK8-10)

Notes on Case C: is there a possible genetic link between Crohn disease and Parkinson disease? (disease - disease)

assuming the input curies are MONDO:0005011 (Crohn disease) and MONDO:0005180 (Parkinsons)
possible answers:
- LRRK2
- but also, it's not really clear if there's a genetic link
1st template (saved response): SequenceVariant intermediate: Runs in 16 s, 1 result rs2066842 in gene NOD2
2nd template (saved response): Gene intermediate. Runs in 1 min 55 s, 392 results, 1st result is LRRK2.

Notes on Case D: What "molecular mechanisms" could explain the link between SLC6A20 and susceptibility to COVID19? (gene - disease)

assuming the input curies are NCBIGene:54716 (SLC6A20), MONDO:0100096 (COVID19)
- SLC6A20 is the gene name, VS the protein product has multiple names (SIT1, sodium-dependent Imino Transporter 1 / System IMINO transporter, XTRP3)
possible answers:
- ACE2 (gene) - 2nd paragraph of this paper
- glycine (amino acid, chemical entity)
1st template (saved response): 1 unconstrained intermediate. Runs in 2 min 26s, 38 results, top result is ACE2. Glycine is also there, as last result.
2nd template (saved response): chemical + gene/protein intermediates. Runs in 1 min 30s, 186 results. 6th result is the possible answer SLC6A20 ➡️ glycine ➡️ ACE2 ➡️ COVID19. Other results include glycine (1st, 3rd) or ACE2 (11th).

Template guidelines:

MUST have >= 1 true intermediate node (no direct-edge lookup since QGraph/result format depends on filling out the unpinned node…)
keep query-edge direction n0 ➡️ un ➡️ n2
don't use is_set: true: generate separate results per unique node-collection

Jackson Callaghan · Answer 8 · Fri Mar 22 2024 04:20:49 GMT+0800 (China Standard Time)

so one way to restate it is: for each template run, record (number of results) * (number of intermediate nodes in that template's QGraph, will probably be 1 or 2). Sum these together as templates are run. Stop when this number >= 500.

@colleenXu Agreed, this is a good heuristic.

@rjawesome Please note that we've finalized our expectation for how "final" results should be generated when iterating over template results (as inferred-handler result support graphs). This is detailed on slides 9-13 in the above linked slides.

Colleen Xu · Answer 9 · Fri Mar 22 2024 15:25:36 GMT+0800 (China Standard Time)

@rjawesome @tokebe

I have an update on point 1 of the "Important Considerations": I can't recreate the buggy behavior, so maybe things are fine?

The previous buggy behavior was: I set up a Pathfinder TRAPI query with two starting IDs/nodes that we shouldn't find any results for - but instead results were returned that connected only to the first starting ID/node. (ref: lab Slack convo starting here)

But I wasn't able to recreate this behavior using the current pathfinder-templates branch

Check out the query-handler pathfinder-templates branch, which also has the check on QNode IDs modified biothings/bte_trapi_query_graph_handler@8deeada (commit originally from @tokebe's inferred-explain branch). Be on the main branches for all other modules.
Adjust the query-handler data/templateGroups.json array so it only contains the "Pathfinder: Drug-Disease" template-group object.
Setup BTE in "CI" mode (pnpm build, API_OVERRIDE=true INSTANCE_ENV=ci pnpm run smartapi_sync, then I use INSTANCE_ENV=ci USE_THREADING=false pnpm start)
Run this query: this is basically what the "artificial query graph" that goes into the inferred-mode-handler will look like = 1 QEdge with predicate related_to and knowledge_type inferred + both QNodes set to IDs.
BTE finds NO results which is the expected behavior.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:5291"],
                    "categories":["biolink:ChemicalEntity"],
                    "name": "imatinib"
                },
                "n1": {
                    "ids":["MONDO:0011821"],
                    "categories":["biolink:DiseaseOrPhenotypicFeature"],
                    "name": "Meckel syndrome, type 3"
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:related_to"],
                    "knowledge_type": "inferred"
                }
            }
        }
    }
}

But I did hit another bug, which didn't halt execution. I'll open another issue for it.

Rohan Juneja · Answer 10 · Sat Mar 30 2024 08:20:13 GMT+0800 (China Standard Time)

Functionality should be finished in the pathfinder branch of bte_query_graph_handler repo.
I did want to add in a few tests / check over the code a bit more before making a PR. However, it would be a good idea to make sure that the functionality is implemented correctly.

Current test query that I have been using

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "PUBCHEM.COMPOUND:5291"
                    ],
                    "categories": ["biolink:Drug"]
                },
                "un": {
                    "categories": [
                        "biolink:NamedThing"
                    ]
                },
                "n2": {
                    "ids": [
                        "MONDO:0004979"
                    ]
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "un",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "knowledge_type": "inferred"
                },
                "e1": {
                    "subject": "un",
                    "object": "n2",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "knowledge_type": "inferred"
                },
                "e2": {
                    "subject": "n0",
                    "object": "n2",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "knowledge_type": "inferred"
                }
            }
        }
    }
}

Jackson Callaghan · Answer 11 · Wed Apr 03 2024 03:41:58 GMT+0800 (China Standard Time)

@colleenXu @rjawesome I've done a brief code review of the branch, the execution looks pretty straightforward and good to me, so it's on to testing.

There are a couple of notes, which might be better discussed in a draft PR:

What's the purpose of the scores attribute defined on interface CreativePathfinderResponse? I don't see additional references to it in the branch, and IIRC that wouldn't be proper TRAPI.
This is a purely code-style note, but a good thing to watch for code readability: There are a few places (particularly in the main parse loop) where you've indented quite far due to nested if/else branches. The readability can be improved using condition guards to avoid nesting where practical.

Rohan Juneja · Answer 12 · Wed Apr 03 2024 07:51:59 GMT+0800 (China Standard Time)

I think I added that type when I was working on it earlier but it is no longer needed. I removed it.
Should be addressed in the latest commits.

Colleen Xu · Answer 13 · Wed Apr 03 2024 10:19:35 GMT+0800 (China Standard Time)

@rjawesome @tokebe

I checked out the pathfinder branch and I can't successfully build. Perhaps the issue is that this branch isn't merged with the latest main?

Here's the error

@biothings-explorer/query_graph_handler:build: > @biothings-explorer/query_graph_handler@1.18.0 build /Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler
@biothings-explorer/query_graph_handler:build: > tsc -b
@biothings-explorer/query_graph_handler:build: 
@biothings-explorer/query_graph_handler:build: src/batch_edge_query.ts:1:20 - error TS2614: Module '"@biothings-explorer/call-apis"' has no exported member 'RedisClient'. Did you mean to use 'import RedisClient from "@biothings-explorer/call-apis"' instead?
@biothings-explorer/query_graph_handler:build: 
@biothings-explorer/query_graph_handler:build: 1 import call_api, { RedisClient } from '@biothings-explorer/call-apis';
@biothings-explorer/query_graph_handler:build:                      ~~~~~~~~~~~
@biothings-explorer/query_graph_handler:build: 
@biothings-explorer/query_graph_handler:build: 
@biothings-explorer/query_graph_handler:build: Found 1 error.
@biothings-explorer/query_graph_handler:build: 
@biothings-explorer/query_graph_handler:build:  ELIFECYCLE  Command failed with exit code 1.
@biothings-explorer/query_graph_handler:build: ERROR: command finished with error: command (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler) /Users/colleenxu/Library/pnpm/pnpm run build exited (1)
@biothings-explorer/query_graph_handler#build: command (/Users/colleenxu/Desktop/biothings_explorer/packages/query_graph_handler) /Users/colleenxu/Library/pnpm/pnpm run build exited (1)

 Tasks:    9 successful, 10 total
Cached:    9 cached, 10 total
  Time:    7.4s 
Failed:    @biothings-explorer/query_graph_handler#build

 ERROR  run failed: command  exited (1)
 ELIFECYCLE  Command failed with exit code 1.

Colleen Xu · Answer 14 · Wed Apr 03 2024 14:38:38 GMT+0800 (China Standard Time)

@rjawesome @tokebe

I've added support for example/cases 2 (chem - gene) and 3 (disease - disease):

pathfinder-templates branch now has templates/filled-out templateGroups for those cases
updated my notes in the comment above

That branch is also has an update to one of the earlier templates (commit) AND is merged with the latest main.

Jackson Callaghan · Answer 15 · Thu Apr 04 2024 03:52:56 GMT+0800 (China Standard Time)

@rjawesome You'll have to pull in the latest from the main branch and fix any merge conflicts

Rohan Juneja · Answer 16 · Thu Apr 04 2024 09:04:43 GMT+0800 (China Standard Time)

@colleenXu The main branch should be merged now, which fixes the RedisClient error. Also, the new templates from pathfinder-templates have been merged into pathfinder

Jackson Callaghan · Answer 17 · Fri Apr 05 2024 05:09:15 GMT+0800 (China Standard Time)

@rjawesome: @colleenXu and I ran some testing on the imatinib-asthma example, and we're seeing some odd behavior:

Truncation and pruning code seems to be failing to remove nodes, edges, and pathfinder support graphs from results that have been truncated. Pruning behavior will probably have to be updated to more intelligently check whether a node/edge/support graph is being used in a result.
Every result e0 appears to be actually in the proper format for e2, while e0 shouldn't contain the hop un->n2. Every e1 appears to have a single, empty aux graph (where it should instead contain anything supporting the hop from un->n2). Meanwhile, every e2 is the same edge with hundreds of aux graphs, where it should be exactly as e0 currently appears (a unique e2 per result, with one aux graph).

Colleen Xu · Answer 18 · Fri Apr 05 2024 05:26:14 GMT+0800 (China Standard Time)

@rjawesome

This goes with Jackson's comment above. I think it's easiest to understand visually w/ screenshots. I'm comparing the pathfinder run to running just the template that it's using. Here's the full response jsons for both, which I viewed in a json-viewer and in ARAX-UI (import -> response):

(Thankfully, this example query is pretty simple: 1 template ran, this template provides unique, single intermediate nodes in each result. So there's a 1-to-1 match between final pathfinder results and the template's results)

Point 1 example: Everything related to this intermediate node should have been pruned, but it's all still there

This is the bottom result for the template. This intermediate node (FBLN5, NCBIGene:10516) should be removed from the KG, as well as the stuff associated with it (edges + aux-graphs that are unique to this intermediate node's pathfinder result, both the original template stuff and the pathfinder-constructed stuff).

But they're still there in pathfinder-response:

A KG Node

A normal edge (from template)

Pathfinder edges and aux-graphs

Point 2 example: pathfinder support-graph issues

This is showing the first template's result, with the intermediate node KIT (NCBIGene:3815). We expect e0's support-graph to include all the edges from imatinib (n0) to KIT (un), e1 to include the edge from KIT to asthma (n2), and e2 to include all the edges in this result.

So then we look at the first pathfinder result...

e0 has all the edges in the result (which is what we wanted for e2)

e1 has no edges (empty array)

e2 has a ton of support graphs

Rohan Juneja · Answer 19 · Fri Apr 05 2024 08:57:26 GMT+0800 (China Standard Time)

Pruning has been added to pathfinder. Intermediate edges (e0/e1 from Jackson's test) and main edge (e0 from Jackson's test) have been updated so their auxiliary graphs should be correct now.

Colleen Xu · Answer 20 · Fri Apr 05 2024 15:58:42 GMT+0800 (China Standard Time)

@rjawesome @tokebe

I've added support for the last example 4/D (gene - disease):

commit in pathfinder-templates branch
updated my notes in the comment above

Should I make template adjustments directly in the pathfinder branch from now on?

Jackson Callaghan · Answer 21 · Fri Apr 05 2024 22:31:17 GMT+0800 (China Standard Time)

Should I make template adjustments directly in the pathfinder branch from now on?

@colleenXu Yes, I think that makes sense. It shouldn't cause any merge issues with any work done to code in the branch.

Jackson Callaghan · Answer 22 · Fri Apr 05 2024 23:01:51 GMT+0800 (China Standard Time)

@rjawesome I've reviewed your changes and each result edge looks nearly correct now. I see only one remaining problem -- the now correctly-aux-graph'd e2 has its subject and object as n0->un when it should be n0->n2 (even though the edge is bound correctly in the result).

Rohan Juneja · Answer 23 · Sat Apr 06 2024 10:19:25 GMT+0800 (China Standard Time)

e2's aux graph has now been fixed. I've also added some more tests around this behavior.

Colleen Xu · Answer 24 · Sat Apr 06 2024 14:58:02 GMT+0800 (China Standard Time)

@tokebe @rjawesome

I think we are preserving the support-graph info for subclass-edges correctly.

However, it's not showing up properly in the ARAX-UI. This is happening both for our "normal" creative-mode and our pathfinder responses. It's odd because I recall this stuff showing up properly in the past.

Example from normal creative-mode

Saved response from running "treats"-creative mode for MONDO:0007035 (Acanthosis nigricans).

The 4th result has a top-level creative-support-graph.

When I go into that support-graph and then look at the pheno edges, all should have support-graphs based on their IDs. Instead, no info is shown - not even source info.

Example from pathfinder Case A (imatinib-asthma)

The 5th result in the template run is PDGFRA. When you look at that template's run in ARAX-UI, you can see the support-graph/source info for one of the PDGFRA->asthma edges.

But if you look at 5th pathfinder result in ARAX-UI (saved response), that same edge now doesn't show any info.

When I dig into the pathfinder json, all the info for this subclass-edge/its linked support-graph seems to exist and be properly formatted.

The subclass edge

                "NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass": {
                    "predicate": "biolink:gene_associated_with_condition",
                    "subject": "NCBIGene:5156",
                    "object": "MONDO:0004979",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:support_graphs",
                            "value": [
                                "support0-NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass"
                            ]
                        }
                    ],
                    "sources": [
                        {
                            "resource_id": "infores:biothings-explorer",
                            "resource_role": "primary_knowledge_source"
                        }
                    ]
                },

the subclass support-graph

            "support0-NCBIGene:5156-gene_associated_with_condition-MONDO:0004979-via_subclass": {
                "edges": [
                    "13aa493dafd322cb77c438173de6abd4",
                    "expanded-MONDO:0005405-subclass_of-MONDO:0004979"
                ]
            },

The support-graph's edges + subclass-disease node

Gene to subclass-disease

                "13aa493dafd322cb77c438173de6abd4": {
                    "predicate": "biolink:gene_associated_with_condition",
                    "subject": "NCBIGene:5156",
                    "object": "MONDO:0005405",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:16804324"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        }
                    ],
                    "sources": [
                        {
                            "resource_id": "infores:disgenet",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:mydisease-info",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:disgenet"
                            ]
                        },
                        {
                            "resource_id": "infores:biothings-explorer",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:mydisease-info"
                            ]
                        }
                    ]
                },

subclass-disease to main-disease

                "expanded-MONDO:0005405-subclass_of-MONDO:0004979": {
                    "predicate": "biolink:subclass_of",
                    "subject": "MONDO:0005405",
                    "object": "MONDO:0004979",
                    "attributes": [],
                    "sources": [
                        {
                            "resource_id": "infores:mondo",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:biothings-explorer",
                            "resource_role": "aggregator_knowledge_source"
                        }
                    ]
                },

subclass-disease node exists as well

                "MONDO:0005405": {
                    "categories": [
                        "biolink:Disease"
                    ],
                    "name": "childhood onset asthma",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:xref",
                            "value": [
                                "MONDO:0005405",
                                "DOID:0080815",
                                "UMLS:C0264408",
                                "MEDDRA:10081274",
                                "SNOMEDCT:233678006"
                            ]
                        },
                        {
                            "attribute_type_id": "biolink:synonym",
                            "value": [
                                "childhood onset asthma",
                                "childhood-onset asthma",
                                "Childhood asthma"
                            ]
                        }
                    ]
                },

Colleen Xu · Answer 25 · Sat Apr 06 2024 15:23:36 GMT+0800 (China Standard Time)

This post will be recording what tests I'm running, the response-jsons, basic response stats, and other notes. I'll raise errors/problems in separate comments.

Basic tests

click to expand

Different starting query topologies (does it correctly throw error or continue execution):

Only two edges (correct error)
Different edge directions (correct error)
Different node/edge labels (correct continues execution)
Don't include categories on starting ID nodes (correct continues execution)

imatinib -> Meckel syndrome, type 3 (MONDO:0011821): (chem - disease) NEGATIVE CONTROL from previous comment

runs in 29s
0 results! after running all 3 templates.

Cases

Noting my possible answers and Sui's possible answers.

Case A (asthma) is an example of truncating the 1st template's results to get a 500 result set.

Case A (allergic asthma) and D have results/intermediate nodes that were found in multiple templates (showing that the merging code worked as-intended).

2 Case A (chem - disease) examples

imatinib (PUBCHEM.COMPOUND:5291) -> asthma (MONDO:0004979) (saved response):

runs in 2 min 11s
500 results
Only runs 1st template and prunes extra template results
found Sui's possible answers
- KIT: top result
- SCF (aka KITLG, KIT ligand): 359th result

imatinib -> allergic asthma (MONDO:0004784) (saved response):

runs in 1 min 38s
419 results
Runs all 3 templates, results only from 1st and third. Doesn't prune any template results.
found Sui's possible answers
- KIT: 15th result
- SCF (aka KITLG, KIT ligand): 279th result
found my possible answers
- immune response: 3rd result

Case B (chemical - gene) - currently running only 1 template

Resveratrol (PUBCHEM.COMPOUND:445154) -> glyoxalase, GLO1 (NCBIGene:2739) (saved response)

runs in 32s
84 results
Runs only 1 simple template
found 1 of Sui's possible answers
- NFE2L2: 3rd result

Case C (disease - disease)

Crohn Disease (MONDO:0005011) -> Parkinson Disease (MONDO:0005180) (saved response)

runs in 2 min 22s
393 results
Runs both templates, results from both. Doesn't prune any template results.
found all Sui's possible answers? (I'm not sure if Sui meant MOD2 gene or NOD2 gene. We have NOD2 variant rs2066842 as top result + NOD2 gene as 7th result)
- LRRK2: 2nd result
- PARK7: 3rd result

Case D (gene - disease)

SLC6A20 (NCBIGene:54716) -> COVID19 (MONDO:0100096) (saved response)

runs in 4 min 7s
116 results
Runs both templates, results from both. Doesn't prune any template results.
found Sui's possible answers
- ACE2: top result (graph includes glycine)
- CXCL8: result 9
found my possible answers
- glycine: 3rd result

Colleen Xu · Answer 26 · Sat Apr 06 2024 15:33:20 GMT+0800 (China Standard Time)

@rjawesome @tokebe

A problem: pathfinder doesn't find templates for Case B (chem - gene). I'm not sure what's going on.

Query I'm using

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": ["PUBCHEM.COMPOUND:445154"],
          "categories":["biolink:ChemicalEntity"],
          "name": "Resveratrol"
        },
        "un": {
          "categories": ["biolink:NamedThing"]
        },
        "n2": {
          "ids": ["NCBIGene:2739"],
          "categories":["biolink:Gene"],
          "name": "glyoxalase, GLO1"
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "un",
          "predicates": ["biolink:related_to"],
          "knowledge_type": "inferred"
        },
        "e1": {
          "subject": "un",
          "object": "n2",
          "predicates": ["biolink:related_to"],
          "knowledge_type": "inferred"
        },
        "e2": {
          "subject": "n0",
          "object": "n2",
          "predicates": ["biolink:related_to"],
          "knowledge_type": "inferred"
        }
      }
    }
  }
}

Jackson Callaghan · Answer 27 · Tue Apr 09 2024 23:00:33 GMT+0800 (China Standard Time)

@colleenXu I'll be working on the pathfinder prototype this week as Rohan is unavailable.

Regarding ARAX UI problems, that might be worth reporting to them -- otherwise it's a good note that we should trust our own JSON analysis first.

I'll take a look into the Case B issue.

Colleen Xu · Answer 28 · Wed Apr 10 2024 04:06:46 GMT+0800 (China Standard Time)

Reported in Translator architecture channel (link here)

Colleen Xu · Answer 29 · Thu Apr 11 2024 14:53:59 GMT+0800 (China Standard Time)

@tokebe Whoops I didn't set the pathfinder flag on the Case B template group. Added this in a recent commit. Haven't analyzed the behavior yet though.

Colleen Xu · Answer 30 · Fri Apr 12 2024 12:07:26 GMT+0800 (China Standard Time)

There's still a problem running Case B. The 2nd template runs quickly (1 min 1s), but returns a lot of results (4472). Inferred-mode then seems to get stuck on "merging" all of the results into 1 mega-result/creative-edge - it may take ~ 1 hour? And then Pathfinder also seems to get stuck finding the intermediate nodes (I didn't wait for it to complete).

I was thinking of Case B as testing multiple things that don't happen with the other cases:

run multiple templates and truncate/prune because the later template has a lot of template results - does the truncation/prune process work as-intended?
after that, how did the merging go? Some results/intermediate nodes should have been found in multiple templates.

@tokebe For tomorrow's deployments, I've made a branch pathfinder-simpleCaseB that doesn't use the 2nd chem-gene template. BTE will then successfully run the chem-gene example (CaseB) - but it won't find much.

Rohan Juneja · Answer 31 · Fri Apr 19 2024 08:45:51 GMT+0800 (China Standard Time)

Case B should be fixed. There was an unnecessary while loop that was causing the issues in the inferred mode handler. For the intermediate nodes, the "paths" involved were getting too long so I changed it so each "path" will only use edges from one template result (ie. each path will only include one pair of intermediate genes), but each intermediate node will merge all the "paths" that include it). Previously the paths were getting too long by combining many edges from different template results.

Colleen Xu · Answer 32 · Tue Apr 30 2024 14:00:56 GMT+0800 (China Standard Time)

Note:

In the Translator Architecture 4/23 call, the UI team said they'll handle "4-hop paths" (aka 4 edges long).

I think we'll stay at/under that limit with our current Pathfinder templates. All are 2-3 QEdges long.

There's 1 potential case where BTE would generate 5-edge paths: if it ran the 2nd/3rd "Chem-Disease" templates (3 QEdges) and results involved descendants of both the chemical and the disease starting-ID (+2 subclass_of edges). However, I think it's relatively rare for us to do subclass-expansion on chemical starting-IDs.

Jackson Callaghan · Answer 33 · Thu May 02 2024 01:29:57 GMT+0800 (China Standard Time)

@rjawesome Does your optimization change the output at all?

Rohan Juneja · Answer 34 · Thu May 02 2024 01:34:34 GMT+0800 (China Standard Time)

It basically just limits the length of result "paths," so it doesn't compute graphs that have more hops then what is specified in the template (excluding subclass hops).

Jackson Callaghan · Answer 35 · Thu May 02 2024 02:31:24 GMT+0800 (China Standard Time)

So, if I'm understanding correctly, you've changed the implementation to be more like that specified in the slides (building the new aux graphs by iterating over each template result), whereas before you were merging multiple template results and then performing a DFS on them?

If not, can you briefly describe the steps in your current implementation, comparing them to the approach in the slides?

Rohan Juneja · Answer 36 · Thu May 02 2024 03:23:43 GMT+0800 (China Standard Time)

So, if I'm understanding correctly, you've changed the implementation to be more like that specified in the slides (building the new aux graphs by iterating over each template result), whereas before you were merging multiple template results and then performing a DFS on them?

Yes.

Colleen Xu · Answer 37 · Thu May 02 2024 14:14:16 GMT+0800 (China Standard Time)

@tokebe @rjawesome

I think there's a problem! The new code is giving different output with less results, missing KG edges, and different scores.

I saw this with Case A allergic asthma:

previous response json: 419 results
current response json: 344 results. (It uses different API registrations for TRAPI 1.5, but don't worry about that)

Actual Pathfinder TRAPI query

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": ["PUBCHEM.COMPOUND:5291"],
          "categories":["biolink:ChemicalEntity"],
          "name": "imatinib"
        },
        "un": {
          "categories": ["biolink:NamedThing"]
        },
        "n2": {
          "ids": ["MONDO:0004784"],
          "categories":["biolink:DiseaseOrPhenotypicFeature"],
          "name": "allergic asthma"
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "un",
          "predicates": ["biolink:related_to"],
          "knowledge_type": "inferred"
        },
        "e1": {
          "subject": "un",
          "object": "n2",
          "predicates": ["biolink:related_to"],
          "knowledge_type": "inferred"
        },
        "e2": {
          "subject": "n0",
          "object": "n2",
          "predicates": ["biolink:related_to"],
          "knowledge_type": "inferred"
        }
      }
    }
  }
}

Here's what I found when digging in:

If you compare the logs, the template-runs appear identical. So I don't think this is coming from different records/data. Templates 1 + 3 have the same number of nodes / edges / results / queries returning records / API list. Template 2 terminates early for both because no records found for QEdge.
However, I see console logs in the current run that show that nodes are being pruned. My previous note says no template results were pruned in the previous run, so I don't think nodes were pruned.

Expand to see logs

  bte:biothings-explorer-trapi:inferred-mode pruning creative combinedResponse nodes/edges... +0ms
  bte:biothings-explorer-trapi:inferred-mode pruned 75 nodes, 246 edges, 0 auxGraphs from combinedResponse. +4ms
  bte:biothings-explorer-trapi:pathfinder [Pathfinder]: Performing search for intermediate nodes. +2m
  bte:biothings-explorer-trapi:pathfinder [Pathfinder]: Pathfinder found 344 intermediate nodes and created 1032 support graphs. +28ms
  bte:biothings-explorer-trapi:inferred-mode pruning creative combinedResponse nodes/edges... +32ms
  bte:biothings-explorer-trapi:inferred-mode pruned 0 nodes, 1 edges, 386 auxGraphs from combinedResponse. +7ms

Example of a missing result: CSF1R (NCBIGene:1436). It's the top result in the previous run, but it's completely missing from the current run.
Example of result with missing edges/different score: AKT1 (NCBIGene:207). See the screenshots below. I think the scores differ because of the one missing edge which has a PMID publication.

Screenshots of AKT1 result showing missing edge/diff score

Previous run:

Current run:

Rohan Juneja · Answer 38 · Fri May 03 2024 08:28:09 GMT+0800 (China Standard Time)

I accidentally introduced a bug when speeding up the while loop that assigned support graph suffixes in the inferred mode handler. Should be fixed now.

Colleen Xu · Answer 39 · Wed May 08 2024 06:25:59 GMT+0800 (China Standard Time)

It looks good!

EDIT: First, I've reran all the "working" cases (not Case B).

First I reran all the "working" cases (not Case B).

For Case A allergic asthma (new saved response), I now see the same number of results (419), KG nodes and edges, and aux-graphs as before. And for all the cases, the interesting results from before are still present.

I see some differences between the runs now and the previous runs, but I think these are okay:

Some results have lower scores now, and it seems to be from moving some publication urls to source_record_urls
Case A asthma (new saved response): I see slightly less edges (-1) and aux-graphs (-4) compared to before. I suspect different results were pruned (same scores, but not in the same order).
Case C (new saved response): now has 500 results, vs 393 previously! But it looks reasonable
- Based on the logs, some KP APIs returned more records - probably because they were updated. And COHD API was down, so we didn't get records from it.
- The second template got >800 results (saved response of just this template). I double-checked and found that the lowest-scoring template-results aren't present in the Pathfinder response (ex: "Genetic Loci" UMLS:C0678933 or NPY NCBIGene:4852), so I think the truncation happened correctly.
Case D (new saved response): now has 114 results, vs 116 previously. But it looks reasonable
- Based on the logs, slightly less edges were returned. This could be because COHD was down or other updates to KP APIs.

Some cases ran faster than before:

Case A asthma (1 min 47 s vs 2 min 11s before)

Other cases ran slower than before:

Case A allergic asthma (1 min 53 s vs 1 min 38s before)
Case C (2 min 43 s vs 2 min 22 s before) - but there were more results returned
Case D (4 min 21s vs 4 min 7s before) - w/ slightly less results returned

Colleen Xu · Answer 40 · Fri May 10 2024 14:59:56 GMT+0800 (China Standard Time)

@tokebe @rjawesome

Something else is going on with Pathfinder and Case B, and I can't tell if it's okay or a sign of a truncation problem.

The good news is that it now ran both templates in 2 min 16 s (much better than running forever!). As a reminder, the second template returns >4000 results (>1000 nodes and >7000 edges) that needs truncating.

Here's a Google Drive folder w/ my Case B Pathfinder run and the an old run of the 2nd template I'm comparing it to (it's not an exact match to the Pathfinder's 2nd template run, but I think it's close enough for what I want to demonstrate).

What I'm seeing: while there's only 500 results in the Pathfinder run...

in the Pathfinder KG nodes and edges, there's intermediate nodes that aren't featured/bound to "un" in a result. They're coming from very low-scoring template 2 results. It seems odd that these intermediate nodes and their edges still exist in the Pathfinder response....but maybe they're connected to high-scoring/result intermediate nodes and are being kept as part of those intermediate nodes's result KGs? I'm not sure. Examples:
- NCBIGene:821 CANX from the 4001th result in the old template 2 run
- NCBIGene:7266 DNAJC7 from the 4005th result in the old template 2 run
while there are only 500 results, there's > 1000 KG nodes and > 9000 KG edges. That seems like a lot, if the Pathfinder response's results only have node-bindings to the top 500 intermediate nodes. It doesn't seem like many nodes/edges from template 2 were truncated from the Pathfinder response.

I didn't notice any truncation issues for Case A asthma and Case C (see my previous notes).

Jackson Callaghan · Answer 41 · Sat May 11 2024 04:37:46 GMT+0800 (China Standard Time)

There should definitely be a large number of nodes and edges that aren't bound to a result directly -- we'd expect a lot of nodes that are exclusively bound to an edge used in a support graph for an edge bound to a result, which could leave a lot of extra nodes and edges that don't have an immediately obvious reason for existing.

It could still be the case that there are nodes and edges that aren't properly truncated, I think the only way we can meaningfully check this is by writing a script that parses a response and checks that every node/edge somehow links (directly or indirectly) to a result. It would have to start with results and then work its way out to build out lists of bound edges/nodes/support graph IDs, and then check those lists against the actual KG and support graph set. @rjawesome could you put together such a script? We'd probably want to adapt it to an integration test later, so it would see use beyond just checking this one time.

Rohan Juneja · Answer 42 · Sat May 11 2024 10:26:26 GMT+0800 (China Standard Time)

I added a test here for pathfinder in particular: https://github.com/biothings/bte_trapi_query_graph_handler/blob/894bbb0e53148035ab73cd44ca4f22e3af5e6fb1/__test__/unittest/pathfinder.test.ts#L103-L146
If pfResponse was to read from a file, then this could function as a "script" to check any given TRAPI response

Jackson Callaghan · Answer 43 · Wed May 15 2024 03:37:43 GMT+0800 (China Standard Time)

Did some messing around with @rjawesome's test to make a script and was able to confirm that yes, pruning is working as expected. Case B just creates huge support graphs which results in many many edges.