implement edge attribute constraints
andrewsu opened this issue · comments
We originally proposed edge attribute constraints in the context of TRAPI 1.3 and #482, but breaking this out to its own ticket.
We have a solid use case for edge constraints proposed in this query template for the CQS:
https://github.com/TranslatorSRI/CQS/blob/main/templates/mvp1-templates/mvp1-template4-bte-aeolus/mvp1-template5-service-provider-aeolus.json
The key bit is here, attempting to apply a minimum threshold on the biolink:evidence_count
from AEOLUS.
"message": {
"query_graph": {
"edges": {
"e0": {
"predicates": [
"biolink:applied_to_treat"
],
"subject": "n0",
"object": "n1",
"attribute_constraints": [
{
"id": "biolink:evidence_count",
"operator": ">",
"value": 20
}
]
}
},
...
There are (at least) two issues that need to be done/checked:
- Filtering of API responses: I'm assuming that it will be easier to do the edge attribute filtering after the subquery, rather than trying to adjust the subquery itself.
- Aggregation of multiple values: In this slack message, @colleenXu pointed out a case where
evidence_count
is provided as a multi-element array (example below). In this case, I think it is reasonable to apply the constraint to the sum of the evidence_counts.
{
"edges": {
"1feea171db6394cfd9bcb20deae0ad9a": {
"predicate": "biolink:applied_to_treat",
"subject": "PUBCHEM.COMPOUND:3386",
"object": "MONDO:0002050",
"attributes": [
{
"attribute_type_id": "biolink:evidence_count",
"value": [
733,
42
]
}
],
"sources": [
{
"resource_id": "infores:aeolus",
"resource_role": "primary_knowledge_source"
},
{
"resource_id": "infores:mychem-info",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:aeolus"
]
},
{
"resource_id": "infores:service-provider-trapi",
"resource_role": "aggregator_knowledge_source",
"upstream_resource_ids": [
"infores:mychem-info"
]
}
]
}
}
}
From my perspective, there's 3 issues at play here:
1: edge-attribute value type
BTE is currently returning this edge-attribute in Dev/CI instances (ref: commit).
However, the value type is currently an array of ints (click for examples).
These are from this example BTE response show-edge-attribute-issue.json, which runs the POST version of this query to MyChem
Example of a 1-element array:
"dd9daae5b03bcad0698ff6669090f36b": {
"predicate": "biolink:applied_to_treat",
"subject": "PUBCHEM.COMPOUND:3386",
"object": "MEDDRA:10070592",
"attributes": [
{
"attribute_type_id": "biolink:evidence_count",
"value": [
875
]
}
],
Example of a multi-element array: the 733 count from Depression and 42 count from "Depressed mood" were put in the same edge/edge-attribute since both meddra IDs mapped to the "MONDO:0002050 (depressive disorder)" entity.
"1feea171db6394cfd9bcb20deae0ad9a": {
"predicate": "biolink:applied_to_treat",
"subject": "PUBCHEM.COMPOUND:3386",
"object": "MONDO:0002050",
"attributes": [
{
"attribute_type_id": "biolink:evidence_count",
"value": [
733,
42
]
}
],
I suggest flattening these into ints, because the array will probably cause validation issues (biolink-model says attribute values should be int) and it'll make the edge-attribute constraint easier to implement.
But we'll need to decide what to do with the multi-element arrays. These are happening because MyChem has separate meddra indication IDs, but BTE/NodeNorm maps them to the same entity. BTE then merges those records into the same edge, and concatenates the counts in the edge-attribute value. I think we could either:
- add the counts from separate records together
- create separate edges for different counts (add to hash?)
(Note: I'm not sure about flattening all 1-element arrays in edge-attributes. biolink:publications
may be one example where we always want it to be an array, but we'd need to check with TRAPI folks first...)
2: what to do with the previous effort - a default, hard-coded count limit
EDIT: SEE UPDATE BELOW - we've implemented this.
We've been trying to add a hard-coded count limit of 20 to our MyChem queries #727 (comment), similar to what we do with SEMMEDDB.
I was able to add it to the aeolusTreats
operation (chem -> disease, commit), which all instances are using.
Old notes on reverse operation
But this hasn't been done for the reverse operation aeolusTreats-rev
(disease -> chem), which is what creative-mode uses. In discussions last week (three Slack links), we finally reached consensus on next steps:
- by adjusting the x-bte annotation, I can get partway there. See this commit (special-reverses branch)
- next is writing/implementing the BTE JQ-post-processing to remove the hits when the
aeolus.indications
field is empty. While this should be quick, I'm unsure of the logic to use and would need to discuss with Jackson...- something super-specific, that only works on responses from this operation?
- generic-ish: "remove hits if this is a BioThings API AND supportBatch is false AND the scopes field specified in the request body (
aeolus.indications.meddra
in this case) isn't in the hit".- the "BioThings API only" and "supportBatch is false" should match "special reverses" - which are the only cases where we'd need this logic
- I don't have other current x-bte annotation examples where this would be useful. I suspect that it may be useful for writing reverse x-bte operations for MyChem chembl drug-mechanism and drugcentral bioactivity
But if we want to implement TRAPI-query edge-attribute constraints, it's not clear if we want to go forward with this. An edge-attribute constraint < 20 would conflict with this hard-coded limit.
3: how to implement this issue's ask: TRAPI-query edge-attribute constraints
This is still up for discussion:
- how generic/general we want our approach to be
- how quickly we can do this
- do we still want a default, hard-coded count limit for these MyChem aeolus indication operations (ex: when an edge-attribute constraint isn't specified)?
Idea: if an edge-attribute constraint is specified...
- after running all sub-queries/building records, filter the records to only those that have the edge-attributes + the values meet the criteria (need to double-check that this is the TRAPI spec).
- pros: this seems to be the easiest to do (conceptually easy, less chance of bugs)
- cons: wasted effort getting records we'll later throw out
- for BioThings APIs, transform the constraint into part of the query
I had another idea of transforming the constraint into part of the BioThings API query using the x-bte annotation templating and info in the response-mapping, but this would be complex and more effort to think through and implement.
During today's group meeting, we made decisions on issues (1) and (3) above:
3: how to implement TRAPI-query edge-attribute constraints
- agreed to do this after retrieving the sub-query response (vs transforming the constraint into part of the sub-query API call)
- two ways to do this: we picked "at/after edge-merging" because we want to keep all the counts that are getting "merged" and add them together
- VS applying the constraint at record creation time, which would throw out individual records where the count < limit. The benefit is having less records to work with in future steps
1: how to flatten multi-element arrays
- agreed to add counts from multi-element arrays together to create int, without doing any steps to remove/throw out counts beforehand. This makes the most conceptual sense.
- use a config list of attribute-type-ids to control when we do this. For now, that list will just include this attribute-type-id (biolink:evidence_count)
- then apply constraint to those sums
Jackson @tokebe estimated that this would take ~2 days of work. But as part of this effort, they'll review the TRAPI yaml/spec docs for QEdge.attribute_constraints expectations and requirements - and they'll decide how full/robust of an implementation to do for this issue.
Note that I've made a PR to ask about this template TranslatorSRI/CQS#9:
- tool queried (BTE vs Service-Provider-TRAPI)
- attribute-constraint's value type - right now it's a string which is confusing
- the attribute-type-id
Update! The template PR TranslatorSRI/CQS#9 has been merged. So the current template is https://github.com/TranslatorSRI/CQS/blob/main/templates/mvp1-templates/mvp1-template4-bte-aeolus/mvp1-template5-service-provider-aeolus.json
Changes:
- attribute-constraint value type is now a int
- using Service-Provider-TRAPI, rather than BTE, then using a separate scoring service. Going to try this out.
Update on issue 2 above
The hard-coded/default/MyChem-query-level limit is now live in Dev/CI! See the details in #727 (comment)
And noting that Issue 1 was also addressed last month as part of #727 (comment)
Leaving just Issue 3 - the edge attribute constraint implementation itself (first post, later decision)