TRAPI 1.5: set_interpretation/MCQ

Question

TRAPI 1.5: set_interpretation/MCQ

colleenXu opened this issue 4 months ago · comments

See notes from 2024-03-27 group meeting.

I have a lot of concerns / questions, and we're going to ask the Translator group for clarification. We'll start with asking about the basic requirements / deadlines.

Colleen Xu · Answer 1 · Tue Mar 26 2024 11:07:59 GMT+0800 (China Standard Time)

This is all of my initial understanding. There seems to be a LOT more going on, which we'll track in the opening post of this issue.

[EDIT: PAUSE FOR DISCUSSION. NEEDS MORE INVESTIGATION ON WHAT THE SPEC IS, based on Eric's reply in Translator Slack]

Previously, we implemented QNode is_set behavior (original issue, behavior with ID/node-expansion).

default behavior (property missing or is_set: false): each result has 1 KG node bound to each QNode
but if a QNode has is_set: true, a result can have >= 1 KG node bound to that QNode. AKA there's a merging/consolidation.

Feature

In TRAPI 1.5, is_set is replaced with set_interpretation, which has more explicit rules for results-assembly (PR, lines 881-896). It's an optional property with string values (enum).

default behavior (property missing or null) == "BATCH": same as before, each result has 1 KG node bound to each QNode
"MANY": same as previous is_set: true behavior.
- Note: This new specification only seems to cover when QNodes have multiple starting IDs. But I'd like to keep our current use of is_set:true / set_interpretation: MANY on QNodes with no starting IDs to merge/consolidate results.
"ALL": new behavior. This should only be set on QNodes that have multiple starting IDs/entities. Similar to the "MANY" behavior, but only keep results that contain all starting IDs/entities.
- AKA if only some of the starting IDs/entities are in the consolidated result, it should be thrown out (and any KG nodes/edges unique to it should be pruned).

Examples

All use the same basic query, just setting set_interpretation to different values. I used HP IDs with no descendants (because ID/node-expansion triggers an automatic use of is_set: true, see #555 (comment))

set to BATCH (default)

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "BATCH"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 39 results: current_default.json. This was generated with current default (not setting is_set)

set to MANY

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "MANY"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 37 results (rather than 39): current_is_set.json. This was generated with current is_set: true.

set to ALL

Query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0500041", "HP:0007750"],
                    "set_interpretation": "ALL"
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

Response should have 2 results (rather than 39 or 37). Only 2 disease entities are connected to both starting entities. See the first two results of current_is_set.json. This was generated with current is_set: true.

MONDO:0009003 (achromatopsia 2)
MONDO:0013560 (Hermansky-Pudlak syndrome 8)

Complications that need discussion

(1) ID/node expansion

Currently, if we find descendants of a starting ID (ID/node expansion), we set that starting ID's QNode to is_set: true #555 (comment). Can we remove this behavior?

the current behavior has unintended consequences, like being completely unable to do is_set: false / set_interpretation: BATCH for some queries
but I'm not sure if we depend on this behavior (we want to keep the behavior of "subclass_of edges using different descendant IDs are kept in the same result")
Context: we implemented this with an old version of representing subclass info (comment). Now we use "constructed edges" + aux-graphs (issue)

Example

I expect 11 results for following query (not setting is_set), but end up with 10 results which is the same as setting is_set: true.

response to the query below, not setting is_set: 10_not_setting_is_set.json
VS response setting is_set 10_is_set_true.json

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0007800", "HP:0025586"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

This happens because ID/node-expansion finds a descendant for 1 of the starting IDs. Console logs:

  bte:biothings-explorer-trapi:main Expanded ids for node n0: (2 ids -> 3 ids) +0ms
  bte:biothings-explorer-trapi:main Added is_set:true to node n0 +1ms

Note to self with another example

This query has 130 results whether is_set: true or not, when it should have >= 134 results when not. It also has some subclass_of edges/aux-graphs, but I'm not sure if it's a good test for seeing if ID/node-expansion becomes wonky after the changes.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories":["biolink:PhenotypicFeature"],
                    "ids":["HP:0003259", "HP:0000110"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               }
            },
            "edges": {
                "eA": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:phenotype_of"]
                }
            }
        }
    }
}

(2) Unclear what the KG Node `is_set` property is for

Asked in Translator Slack:
The PR for set_interpretation also adds an is_set property to KG Nodes (lines 1011-1017). I'm not sure if this is meant to be used, and how (merging KG Nodes??).

Eric's reply in Translator Slack - needs more investigation.

(3) Clarifying an edge case

Asked in Translator Slack:
If set_interpretation is set on a QNode with multiple starting IDs, but these IDs all map to the same entity (using NodeNorm), then there isn't any set behavior to do. Is that fine? Does there need to be any log noting this?

TRAPI 1.5: set_interpretation/MCQ

Feature

Examples

Complications that need discussion

(1) ID/node expansion

(2) Unclear what the KG Node is_set property is for

(3) Clarifying an edge case

(2) Unclear what the KG Node `is_set` property is for