biothings / biothings_explorer

TRAPI service for BioThings Explorer

Home Page:https://api.bte.ncats.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

implement edge attribute constraints

andrewsu opened this issue · comments

We originally proposed edge attribute constraints in the context of TRAPI 1.3 and #482, but breaking this out to its own ticket.

We have a solid use case for edge constraints proposed in this query template for the CQS:
https://github.com/TranslatorSRI/CQS/blob/main/templates/mvp1-templates/mvp1-template4-bte-aeolus/mvp1-template5-service-provider-aeolus.json

The key bit is here, attempting to apply a minimum threshold on the biolink:evidence_count from AEOLUS.

    "message": {
        "query_graph": {
            "edges": {
                "e0": {
                    "predicates": [
                        "biolink:applied_to_treat"
                    ],
                    "subject": "n0",
                    "object": "n1",
                    "attribute_constraints": [
                        {
                         "id": "biolink:evidence_count",   
                         "operator": ">",
                         "value": 20            
                        }
                    ]
                }
            },
...

There are (at least) two issues that need to be done/checked:

  • Filtering of API responses: I'm assuming that it will be easier to do the edge attribute filtering after the subquery, rather than trying to adjust the subquery itself.
  • Aggregation of multiple values: In this slack message, @colleenXu pointed out a case where evidence_count is provided as a multi-element array (example below). In this case, I think it is reasonable to apply the constraint to the sum of the evidence_counts.
{
  "edges": {
    "1feea171db6394cfd9bcb20deae0ad9a": {
      "predicate": "biolink:applied_to_treat",
      "subject": "PUBCHEM.COMPOUND:3386",
      "object": "MONDO:0002050",
      "attributes": [
        {
          "attribute_type_id": "biolink:evidence_count",
          "value": [
            733,
            42
          ]
        }
      ],
      "sources": [
        {
          "resource_id": "infores:aeolus",
          "resource_role": "primary_knowledge_source"
        },
        {
          "resource_id": "infores:mychem-info",
          "resource_role": "aggregator_knowledge_source",
          "upstream_resource_ids": [
            "infores:aeolus"
          ]
        },
        {
          "resource_id": "infores:service-provider-trapi",
          "resource_role": "aggregator_knowledge_source",
          "upstream_resource_ids": [
            "infores:mychem-info"
          ]
        }
      ]
    }
  }
}

From my perspective, there's 3 issues at play here:

1: edge-attribute value type

BTE is currently returning this edge-attribute in Dev/CI instances (ref: commit).

However, the value type is currently an array of ints (click for examples).

These are from this example BTE response show-edge-attribute-issue.json, which runs the POST version of this query to MyChem

Example of a 1-element array:

                "dd9daae5b03bcad0698ff6669090f36b": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MEDDRA:10070592",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": [
                                875
                            ]
                        }
                    ],

Example of a multi-element array: the 733 count from Depression and 42 count from "Depressed mood" were put in the same edge/edge-attribute since both meddra IDs mapped to the "MONDO:0002050 (depressive disorder)" entity.

                "1feea171db6394cfd9bcb20deae0ad9a": {
                    "predicate": "biolink:applied_to_treat",
                    "subject": "PUBCHEM.COMPOUND:3386",
                    "object": "MONDO:0002050",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:evidence_count",
                            "value": [
                                733,
                                42
                            ]
                        }
                    ],

I suggest flattening these into ints, because the array will probably cause validation issues (biolink-model says attribute values should be int) and it'll make the edge-attribute constraint easier to implement.

But we'll need to decide what to do with the multi-element arrays. These are happening because MyChem has separate meddra indication IDs, but BTE/NodeNorm maps them to the same entity. BTE then merges those records into the same edge, and concatenates the counts in the edge-attribute value. I think we could either:

  • add the counts from separate records together
  • create separate edges for different counts (add to hash?)

(Note: I'm not sure about flattening all 1-element arrays in edge-attributes. biolink:publications may be one example where we always want it to be an array, but we'd need to check with TRAPI folks first...)

2: what to do with the previous effort - a default, hard-coded count limit

EDIT: SEE UPDATE BELOW - we've implemented this.

We've been trying to add a hard-coded count limit of 20 to our MyChem queries #727 (comment), similar to what we do with SEMMEDDB.

I was able to add it to the aeolusTreats operation (chem -> disease, commit), which all instances are using.

Old notes on reverse operation

But this hasn't been done for the reverse operation aeolusTreats-rev (disease -> chem), which is what creative-mode uses. In discussions last week (three Slack links), we finally reached consensus on next steps:

  • by adjusting the x-bte annotation, I can get partway there. See this commit (special-reverses branch)
  • next is writing/implementing the BTE JQ-post-processing to remove the hits when the aeolus.indications field is empty. While this should be quick, I'm unsure of the logic to use and would need to discuss with Jackson...
    • something super-specific, that only works on responses from this operation?
    • generic-ish: "remove hits if this is a BioThings API AND supportBatch is false AND the scopes field specified in the request body (aeolus.indications.meddra in this case) isn't in the hit".
      • the "BioThings API only" and "supportBatch is false" should match "special reverses" - which are the only cases where we'd need this logic
      • I don't have other current x-bte annotation examples where this would be useful. I suspect that it may be useful for writing reverse x-bte operations for MyChem chembl drug-mechanism and drugcentral bioactivity

But if we want to implement TRAPI-query edge-attribute constraints, it's not clear if we want to go forward with this. An edge-attribute constraint < 20 would conflict with this hard-coded limit.

3: how to implement this issue's ask: TRAPI-query edge-attribute constraints

This is still up for discussion:

  • how generic/general we want our approach to be
  • how quickly we can do this
  • do we still want a default, hard-coded count limit for these MyChem aeolus indication operations (ex: when an edge-attribute constraint isn't specified)?

Idea: if an edge-attribute constraint is specified...

  • after running all sub-queries/building records, filter the records to only those that have the edge-attributes + the values meet the criteria (need to double-check that this is the TRAPI spec).
    • pros: this seems to be the easiest to do (conceptually easy, less chance of bugs)
    • cons: wasted effort getting records we'll later throw out
  • for BioThings APIs, transform the constraint into part of the query

I had another idea of transforming the constraint into part of the BioThings API query using the x-bte annotation templating and info in the response-mapping, but this would be complex and more effort to think through and implement.

During today's group meeting, we made decisions on issues (1) and (3) above:

3: how to implement TRAPI-query edge-attribute constraints

  • agreed to do this after retrieving the sub-query response (vs transforming the constraint into part of the sub-query API call)
  • two ways to do this: we picked "at/after edge-merging" because we want to keep all the counts that are getting "merged" and add them together
    • VS applying the constraint at record creation time, which would throw out individual records where the count < limit. The benefit is having less records to work with in future steps

1: how to flatten multi-element arrays

  • agreed to add counts from multi-element arrays together to create int, without doing any steps to remove/throw out counts beforehand. This makes the most conceptual sense.
    • use a config list of attribute-type-ids to control when we do this. For now, that list will just include this attribute-type-id (biolink:evidence_count)
  • then apply constraint to those sums

Jackson @tokebe estimated that this would take ~2 days of work. But as part of this effort, they'll review the TRAPI yaml/spec docs for QEdge.attribute_constraints expectations and requirements - and they'll decide how full/robust of an implementation to do for this issue.

Note that I've made a PR to ask about this template TranslatorSRI/CQS#9:

  • tool queried (BTE vs Service-Provider-TRAPI)
  • attribute-constraint's value type - right now it's a string which is confusing
  • the attribute-type-id

Update! The template PR TranslatorSRI/CQS#9 has been merged. So the current template is https://github.com/TranslatorSRI/CQS/blob/main/templates/mvp1-templates/mvp1-template4-bte-aeolus/mvp1-template5-service-provider-aeolus.json

Changes:

  • attribute-constraint value type is now a int
  • using Service-Provider-TRAPI, rather than BTE, then using a separate scoring service. Going to try this out.

Update on issue 2 above

The hard-coded/default/MyChem-query-level limit is now live in Dev/CI! See the details in #727 (comment)

And noting that Issue 1 was also addressed last month as part of #727 (comment)

Leaving just Issue 3 - the edge attribute constraint implementation itself (first post, later decision)