biothings / biothings_explorer

TRAPI service for BioThings Explorer

Home Page:https://api.bte.ncats.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TRAPI 1.5: set_interpretation/MCQ

colleenXu opened this issue · comments

See notes from 2024-03-27 group meeting.

I have a lot of concerns / questions, and we're going to ask the Translator group for clarification. We'll start with asking about the basic requirements / deadlines.

This is all of my initial understanding. There seems to be a LOT more going on, which we'll track in the opening post of this issue.

[EDIT: PAUSE FOR DISCUSSION. NEEDS MORE INVESTIGATION ON WHAT THE SPEC IS, based on Eric's reply in Translator Slack]

Previously, we implemented QNode is_set behavior (original issue, behavior with ID/node-expansion).

  • default behavior (property missing or is_set: false): each result has 1 KG node bound to each QNode
  • but if a QNode has is_set: true, a result can have >= 1 KG node bound to that QNode. AKA there's a merging/consolidation.

Feature

In TRAPI 1.5, is_set is replaced with set_interpretation, which has more explicit rules for results-assembly (PR, lines 881-896). It's an optional property with string values (enum).

  • default behavior (property missing or null) == "BATCH": same as before, each result has 1 KG node bound to each QNode
  • "MANY": same as previous is_set: true behavior.
    • Note: This new specification only seems to cover when QNodes have multiple starting IDs. But I'd like to keep our current use of is_set:true / set_interpretation: MANY on QNodes with no starting IDs to merge/consolidate results.
  • "ALL": new behavior. This should only be set on QNodes that have multiple starting IDs/entities. Similar to the "MANY" behavior, but only keep results that contain all starting IDs/entities.
    • AKA if only some of the starting IDs/entities are in the consolidated result, it should be thrown out (and any KG nodes/edges unique to it should be pruned).

Examples

All use the same basic query, just setting set_interpretation to different values. I used HP IDs with no descendants (because ID/node-expansion triggers an automatic use of is_set: true, see #555 (comment))

set to BATCH (default)

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "BATCH"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 39 results: current_default.json. This was generated with current default (not setting is_set)

set to MANY

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "MANY"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 37 results (rather than 39): current_is_set.json. This was generated with current is_set: true.

set to ALL

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "ALL"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 2 results (rather than 39 or 37). Only 2 disease entities are connected to both starting entities. See the first two results of current_is_set.json. This was generated with current is_set: true.

  • MONDO:0009003 (achromatopsia 2)
  • MONDO:0013560 (Hermansky-Pudlak syndrome 8)

Complications that need discussion

(1) ID/node expansion

Currently, if we find descendants of a starting ID (ID/node expansion), we set that starting ID's QNode to is_set: true #555 (comment). Can we remove this behavior?

  • the current behavior has unintended consequences, like being completely unable to do is_set: false / set_interpretation: BATCH for some queries
  • but I'm not sure if we depend on this behavior (we want to keep the behavior of "subclass_of edges using different descendant IDs are kept in the same result")
  • Context: we implemented this with an old version of representing subclass info (comment). Now we use "constructed edges" + aux-graphs (issue)
Example

I expect 11 results for following query (not setting is_set), but end up with 10 results which is the same as setting is_set: true.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0007800", "HP:0025586"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

This happens because ID/node-expansion finds a descendant for 1 of the starting IDs. Console logs:

  bte:biothings-explorer-trapi:main Expanded ids for node n0: (2 ids -> 3 ids) +0ms
  bte:biothings-explorer-trapi:main Added is_set:true to node n0 +1ms

Note to self with another example

This query has 130 results whether is_set: true or not, when it should have >= 134 results when not. It also has some subclass_of edges/aux-graphs, but I'm not sure if it's a good test for seeing if ID/node-expansion becomes wonky after the changes.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0003259", "HP:0000110"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

(2) Unclear what the KG Node is_set property is for

Asked in Translator Slack:
The PR for set_interpretation also adds an is_set property to KG Nodes (lines 1011-1017). I'm not sure if this is meant to be used, and how (merging KG Nodes??).

Eric's reply in Translator Slack - needs more investigation.

(3) Clarifying an edge case

Asked in Translator Slack:
If set_interpretation is set on a QNode with multiple starting IDs, but these IDs all map to the same entity (using NodeNorm), then there isn't any set behavior to do. Is that fine? Does there need to be any log noting this?