adjust SmartAPI yaml, x-bte annotation for Biolink/Monarch API migration

Question

adjust SmartAPI yaml, x-bte annotation for Biolink/Monarch API migration

colleenXu opened this issue 5 months ago · comments

EDIT: see below for update, actually migrating to v3 https://api-v3.monarchinitiative.org/v3/docs#/

We are using Biolink/Monarch API v1, which will soon be shutdown and replaced by v2 http://api-v2.monarchinitiative.org/api.

So we'll want to adjust the SmartAPI yaml using the v2's swagger spec + adjust the x-bte annotation if needed.

What's unclear at the moment:

are the endpoints / response-format the same in v2?
Does v2 have better performance? related to #773
Are there more endpoints that we'd want annotated/as x-bte operations? (maybe this can be a separate issue, not needed for the migration right now)

Colleen Xu · Answer 1 · Sat Jan 20 2024 06:31:04 GMT+0800 (China Standard Time)

Jackson @tokebe noticed some increased request failures, so I updated the SmartAPI yaml / registration to use the v2 server url (see lab Slack convo). We'll monitor to see if there's any improvement.

I checked every x-bte operation and didn't notice any issues with migrating to v2 - so it seems like the endpoints / response-format were the same.
However, it wasn't clear to me if there was a speed boost when using v2:
- when directly querying the APIs, v2 did appear to respond faster than v1, but I found it hard to test (v1 would be faster if v2 was run first? later queries would be similar and fast, maybe due to caching?)
- my BTE local using v2 seemed to run similarly/slower than BTE-dev (which was using v1 at the time). And both seemed to run significantly slower than the direct queries. So I suspect that the post-processing is taking most of the time for the sub-query.

Potential queries for directly comparing v1 and v2:

pancreas (anatomy) -> gene
- v1 took me 1 min 10 s at first, then 17 s on quick retest: https://api.monarchinitiative.org/api/bioentity/anatomy/UBERON:0001264/genes?rows=-1&direct=true&unselect_evidence=true&taxon=NCBITaxon:9606
- v2 took me 17 s each time (when queried twice in short succession): http://api-v2.monarchinitiative.org/api/bioentity/anatomy/UBERON:0001264/genes?rows=-1&direct=true&unselect_evidence=true&taxon=NCBITaxon:9606
spinal cord (anatomy) -> gene
- v1 took me 30 s: https://api.monarchinitiative.org/api/bioentity/anatomy/UBERON:0002240/genes?rows=-1&direct=true&unselect_evidence=true&taxon=NCBITaxon:9606
- v2 took me 10 s: http://api-v2.monarchinitiative.org/api/bioentity/anatomy/UBERON:0002240/genes?rows=-1&direct=true&unselect_evidence=true&taxon=NCBITaxon:9606

Kevin Schaper · Answer 2 · Tue Jan 23 2024 02:45:11 GMT+0800 (China Standard Time)

Hi @colleenXu,

We're shutting down api.monarchinitiative.org, and our new production api is served from api-v3.monarchinitiative.org. As a transition to let people know that api.monarchinitiative.org is going away, we're planning to put a message up on that host but continue to make it available on another hostname - we picked api-v2 for that, but unfortunately it does make total sense that it would appear to be the replacement.

The v3 api format is different, the good news is that we should be better able to address performance problems (within limits). The v3 api is served from the new core graph, which is built on the biolink data model with new ingests.

Side note, I'm actually not seeing any direct gene expression for spinal cord or pancreas in the new graph:

http://api-v3.monarchinitiative.org/v3/api/association?predicate=biolink:expressed_in&object=UBERON:0001264&direct=true

http://api-v3.monarchinitiative.org/v3/api/association?predicate=biolink:expressed_in&object=UBERON:0002240&direct=true

I created an issue for specifying the subject/object taxon, and a second issue to look at our gene expression ingests.

Colleen Xu · Answer 3 · Wed Jan 31 2024 13:43:57 GMT+0800 (China Standard Time)

[EDITED w/ updated info]

Latest info on the Biolink/monarch migration to v3 https://api-v3.monarchinitiative.org/v3/docs#/:

begins Feb 7 with shutdown of old v1 service https://api.monarchinitiative.org/api/
the API version we currently use (v2) should still be available until March 20
- it's the same as the http://api-biolink.monarchinitiative.org in that blog post
- info from Kevin Schaper (Translator Slack link): "Either one is ok, they're just DNS entries for the same VM - I added api-biolink.mi.org because I realized that it made total sense to assume that api-v2 comes after api, and I wanted to avoid that confusion."
by March 20, we need to be fully using v3 for all instances

So the next steps are:

I get a simple SmartAPI yaml + x-bte annotation done for the new version v3, with like 1-2 operations written -> DONE Jan 31
hand it off to Jackson @tokebe with some example raw queries / what data we'd like to pull out of it, so they can start working on the api-response-transform changes -> DONE

Colleen Xu · Answer 4 · Thu Feb 01 2024 04:28:21 GMT+0800 (China Standard Time)

Info from Kevin Schaper

On using api-biolink vs api-v2 for now:

Either one is ok, they're just DNS entries for the same VM - I added api-biolink.mi.org because I realized that it made total sense to assume that api-v2 comes after api, and I wanted to avoid that confusion.

On what endpoint to use:

The new API is very much biolink/kgx-centric, I'm guessing you'll just be using the /associations endpoint, which largely uses biolink slots for filtering, with biolink predicates & categories as values, etc.

Colleen Xu · Answer 5 · Fri Feb 02 2024 16:37:33 GMT+0800 (China Standard Time)

Notes

On writing SmartAPI yaml

using their OpenAPI spec (downloaded 2/20, converted json ➡️ yaml) as a starting point
made several changes so SmartAPI editor could validate the yaml
- downgraded to OpenAPI 3.0.3 (SmartAPI editor doesn't support 3.1)
- added servers section
- commented out lines that the editor said had errors:
  - type: 'null'
  - examples
commented out the / endpoints (not necessary?)
wrote parts of the info section (contact, description, termsOfService url, title) using information in the paper and on the website

Querying the v3 API

(this is an old note on the download parameter, which the association endpoint doesn't have anymore) ~~setting the download parameter as false often didn't work - I'd be prompted to download the response as a file. Instead, not specifying the download parameter at all seemed to work best~~
using the association endpoint after feedback from Kevin Schaper (Translator Slack post)
- has more support than the entity endpoints (ref: Kevin's comment)
- can set the input ID specifically as the subject or object (can't do that with the entity endpoints)
- more complex querying possible: subject/object category, taxon, namespace
these are GET queries, so only 1 input ID at a time (not batch)
each query returns only 500 items
- I encountered error 500 (Internal Service Error) when trying to set the limit parameter to > 500 or to -1 (worked w/ the old API to return all hits)
- we'd need code changes to support "scrolling" GET queries to get all the items (involving the offset parameter and total field in the response)
useful stuff for finding examples, possible associations:
- entity/{id}: can see what kinds of things are connected to that input ID (association_counts field)
- AssociationCategory enum in openapi spec
- AssociationPredicate enum in openapi spec
- EntityCategory enum in openapi spec
old data/operations that are no longer available, but were in the v1/v2 API (keeping these v1 links as examples, but they're broken now that the v1 API has been shut down)
- Disease ⭤ Pathway (example)
- any relationship with Variants (dbSNP/rsIDs): note that DiseaseOrPhenotypicFeatureToGeneticInheritanceAssociation connects to HP terms for inheritance (like autosomal dominant)
- gene ortholog ⭤ Disease
- gene ortholog ⭤ Phenotype

(@kevinschaper and any others working on the Monarch API may find this post interesting)

Colleen Xu · Answer 6 · Fri Feb 02 2024 16:38:23 GMT+0800 (China Standard Time)

[defunct: using association endpoint instead]

On BTE post-processing

A. Directionality

We query with an input ID, which will match to the subject or object fields in the hit/item depending on the association type (which is fixed in the biolink-model canonical predicate direction).

If the input ID matches the subject, then each item's direction field == outgoing and the output entity ID will be in the object field...

check, only keep items where the input ID exactly matches the value of the field subject
- This API traverses ontologies and can return edges for the descendants of the input ID -> we don't want BTE to erroneously create edges to the input/starting ID that don't actually exist
check, only keep items where the operation's output namespace matches the value of the field object_namespace
- While I haven't seen examples of this yet, the previous API could return data with multiple namespaces. We don't want BTE misparsing those, and BTE/x-bte operations are only set up to handle 1 input/output namespace at a time. use case for #748

VS If the input ID matches the object, then each item's direction field == incoming and the output entity ID will be in the subject field...

check, only keep items where the input ID exactly matches the value of the field object
check, only keep items where the operation's output namespace matches the value of the field subject_namespace

B. Publications

For now: within an item/hit, only keep elements in the publications field array that have the prefix PMID. These will be in the format PMID:24468074.

I've noticed other kinds of elements like:

OMIM curies
orphanet curies

Also, there's a publications_links field but we may need special logic to decide when to use the publications_links.id (for PMID) vs publications_links.url (for other kinds of references?).

Colleen Xu · Answer 7 · Sat Feb 03 2024 06:01:48 GMT+0800 (China Standard Time)

[defunct: using association endpoint instead]

Queries to test the post-processing checks

input ID matches subject field

When querying Monarch API for Disease autosomal dominant cerebellar ataxia (MONDO:0020380) -> PhenotypicFeature, a lot of edges are returned that connect to a subclass of that disease instead: https://api-v3.monarchinitiative.org/v3/api/entity/MONDO:0020380/biolink:DiseaseToPhenotypicFeatureAssociation?format=json&limit=30&offset=0

When we query this API through BTE (saved response: example.json), we find multiple examples that BTE is handling this correctly...aka it isn't creating incorrect edges between the input ID and the phenotypes that are actually connected to the subclasses:

results where there's only 1 edge, which has aux-graphs. This means there were no hits/items/records where the input ID was the subject. BTE correctly didn't make any direct edges.
- (7) Difficulty walking
- (10) Gait imbalance
- (11) Horizontal nystagmus
results where the publications in the direct edge don't match the publications in the aux-graph edges. So BTE correctly didn't merge direct edges and edges to the subclasses.
- (2) direct edge has PMID:36516086 but indirect edge for spinocerebellar ataxia 45 (MONDO:0033480) has PMID:29053796
- (24) direct edge has no publications, but indirect edge for spinocerebellar ataxia type 38 (MONDO:0014417) has PMID:25065913

input ID matches object field

When querying Monarch API for PhenotypicFeature Clinodactyly (HP:0030084) -> Disease, a lot of edges are returned that connect to a subclass of that pheno instead: https://api-v3.monarchinitiative.org/v3/api/entity/HP:0030084/biolink:DiseaseToPhenotypicFeatureAssociation?format=json&limit=30&offset=0

When we query this API through BTE (saved response: example-2.json), we find multiple examples that BTE is handling this correctly...aka it isn't creating incorrect edges between the input ID and the diseases that are actually connected to the subclasses:

results where there's only 1 edge, which has aux-graphs. This means there were no hits/items/records where the input ID was the subject. BTE correctly didn't make any direct edges.
- (1) trisomy 8p
- (6) paternal uniparental disomy of chromosome 14
- (7) rhizomelic limb shortening with dysmorphic features

Colleen Xu · Answer 8 · Sat Feb 03 2024 15:37:19 GMT+0800 (China Standard Time)

Stuff to follow up on later?

double-check on the use of infores:monarchinitiative (switching to this for info.x-translator.infores) vs infores:biolink-api (what we were using before).
- may involve adjusting the xref wiki page, to make more obvious that we are using their non-TRAPI API?
- BTE is still using this field to set the upstream resource ID for the bte/service-provider source element...
- We made the switch because when using biolink-api, BTE doesn't generate a source-object for it and the provenance chain would be wonky (probably because of the post-processing to instead use the API response provenance info).

Example of wonky behavior

The edge source info would look like this:

service-provider trapi says biolink-api is upstream of it
but there's no entry for biolink-api (and...then monarchinitiative should be upstream?)
then there's entries for monarchinitiative and its upstream sources (which include the primary). These come from post-processing the raw API response.

                    "sources": [
                        {
                            "resource_id": "infores:hpo-annotations",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:hpo-annotations"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:biolink-api"
                            ]
                        }
                    ]

ability to do "scrolling" GET queries to get all the items (involving the offset parameter and total field in the response). Currently, we only get 500 items per input ID/query (noted in previous post "Querying the v3 API")
investigate the new query options? subject/object category, taxon, namespace: example
- look at what adds coverage, is good to have as separate operations (namespace, species context?)
annotating more MetaEdges (not covered by past operations)

click to see MetaEdges

Chem to Pathway: unclear how helpful this is, since chemicals seem generic (water, ADP, ATP...). Example. 1 Predicate: participates_in
- their prefix Reactome differs from what we use (REACT)...so this may require extra post-processing support
  (depends on how helpful setting the subject/object namespace is)
- unclear if other Pathway namespaces exist
Gene to Pathway: previously chose not to annotate because MyGene also covers this info. Also has prefix issue (see Chem to Pathway above). 1 Predicate: participates_in
Gene to GO BiologicalProcess (989349 items): previously chose not to annotate because MyGene also covers this info. Each kind has multiple possible predicates, lots of diff primary knowledge sources
- actively_involved_in (797927)
- acts_upstream_of_or_within (180729)
- acts_upstream_of (9327)
- acts_upstream_of_or_within_positive_effect (507)
- acts_upstream_of_positive_effect (506)
- acts_upstream_of_or_within_negative_effect (178)
- acts_upstream_of_negative_effect (175)
Gene to GO MolecularActivity (848151 items): see notes for BiologicalProcess above
- enables (841330)
- contributes_to (6821)
Gene to GO CellularComponent (745837 items): see notes for BiologicalProcess above
- located_in (502225)
- active_in (145515)
- part_of (94049)
- colocalizes_with (4048)
Gene to Gene ortholog: previously chose not to annotate because MyGene also covers this info. 1 predicate (orthologous_to, 551383 hits). Seems to be 1 primary knowledge source (panther)

Colleen Xu · Answer 9 · Tue Feb 06 2024 11:33:55 GMT+0800 (China Standard Time)

Jackson @tokebe:

I changed the x-bte annotation to use the associations endpoint:

has more support than the entity endpoints (ref: Kevin's comment)
upcoming improvements to querying: subject/object category, taxon, namespace

So now the post-processing is different, but hopefully simpler...

STILL NEED:

publications: same post-processing as before.

Publication info from old comment

B. Publications

For now: within an item/hit, only keep elements in the publications field array that have the prefix PMID. These will be in the format PMID:24468074.

I've noticed other kinds of elements like:

OMIM curies
orphanet curies

Also, there's a publications_links field but we may need special logic to decide when to use the publications_links.id (for PMID) vs publications_links.url (for other kinds of references?).

DON'T NEED:

Now input ID should exactly match the subject or object field, so we don't need to check/filter.
- the input ID is explicitly set as the subject or object in the query parameters (couldn't do that with the entity endpoint)
- the query parameters are set to direct edges only (direct: true) - so API shouldn't do any ontology-traversal/expansion to the input ID. Example: lots of hits for autosomal dominant cerebellar ataxia if direct: false, but none if direct: true
checking that output namespace matches subject_namespace or object_namespace field (depending on direction)
- now using the query parameters to explicitly set the namespaces of the subject and object field's IDs.

Colleen Xu · Answer 10 · Thu Feb 08 2024 03:42:49 GMT+0800 (China Standard Time)

[EDITED to add info on what we learned / addressed while working on the API post-processing]

Update

The basic set of updates is done:

SmartAPI yaml w/ x-bte annotation covers all the association-types we covered in the old API that are still available in the new v3 API
tested all operations w/ Jackson's updated post-processing (PR, based on my comment above), and all are working as-expected

Working on

Jackson @tokebe discussed the following, and they're going to try it out: doing post-processing on the primary_knowledge_source and aggregator_knowledge_source response fields, creating a new, custom field formatted as a TRAPI edge sources (array of objects). BTE can then ingest it with the same response-mapping key trapi_sources as Multiomics/Text-Mining APIs.

Example

first hit in https://api-v3.monarchinitiative.org/v3/api/association?category=biolink:CausalGeneToDiseaseAssociation&subject=HGNC:11138&predicate=biolink:causes&direct=true&format=json&limit=10&offset=0

A. "primary_knowledge_source": "infores:omim" (value of this field is always a string: infores curie)
➡️ element for TRAPI sources array

{ 
    "resource_id": "infores:omim", 
    "resource_role": "primary_knowledge_source"
}

B. "aggregator_knowledge_source": ["infores:monarchinitiative", "infores:medgen"]. Value of this field is always an array of string infores-curies, in order from furthest to closest to the primary source. So medgen has omim (the primary source) as its upstream.
➡️ >=1 elements for TRAPI sources array

{ 
    "resource_id": "infores:medgen", 
    "resource_role": "aggregator_knowledge_source",
    "upstream_resource_ids": ["infores:omim"]
},
{ 
    "resource_id": "infores:monarchinitiative", 
    "resource_role": "aggregator_knowledge_source",
    "upstream_resource_ids": ["infores:medgen"]
},

Putting this together: create a new, custom field with the TRAPI sources array

{
    "sources": [
        { 
            "resource_id": "infores:omim", 
            "resource_role": "primary_knowledge_source"
        },
        { 
            "resource_id": "infores:medgen", 
            "resource_role": "aggregator_knowledge_source",
            "upstream_resource_ids": ["infores:omim"]
        },
        { 
            "resource_id": "infores:monarchinitiative", 
            "resource_role": "aggregator_knowledge_source",
            "upstream_resource_ids": ["infores:medgen"]
        }
    ]
}

implementation notes

commented out x-bte operation source field: BTE was ignoring this info because it is using the post-processed sources info instead (from response-mapping trapi_sources)
the aggregator knowledge source array is in a meaningful order (ref: Kevin Schaper, Translator Slack link). We're therefore assuming that the array is in order from furthest -> closest to primary source, so we can include upstream-resource-id info in the source objects
- ex: bte/service provider ➡️ monarchinitiative (1st aggregator entry) ➡️ medgen (2nd aggregator entry) ➡️ omim (primary).
found examples of same subject/predicate/object but different provenance (using biogrid vs string) -> our decision is that records/hits should only be merged if they have the exact same provenance. Handled with biothings/api-respone-transform.js@3534b23

Example showing this

Send the following TRAPI query to Monarch API only, through BTE:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["HGNC:7551"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

BTE should make the following requests:

https://api-v3.monarchinitiative.org/v3/api/association?subject=HGNC:7551&category=biolink:PairwiseGeneToGeneInteraction&subject_namespace=HGNC&predicate=biolink:interacts_with&object_namespace=HGNC&direct=true&format=json&limit=500
- retrieves 2 records showing relationships with TRIM63 (HGNC:16007 / NCBIGene:84676): 1 from biogrid (w/ PMID:19850579) and 1 from string
https://api-v3.monarchinitiative.org/v3/api/association?object=HGNC:7551&category=biolink:PairwiseGeneToGeneInteraction&subject_namespace=HGNC&predicate=biolink:interacts_with&object_namespace=HGNC&direct=true&format=json&limit=500
- retrieves 2 other records showing relationships with TRIM63: 1 from biogrid (w/ diff PMID:18157088) and 1 from string

Then bundle these into two Edges: 1 for biogrid and 1 for string

                "313161c093025842c0f60162954b3340": {
                    "predicate": "biolink:interacts_with",
                    "subject": "NCBIGene:4607",
                    "object": "NCBIGene:84676",
                    "attributes": [
                        {
                            "attribute_type_id": "biolink:publications",
                            "value": [
                                "PMID:19850579",
                                "PMID:18157088"
                            ],
                            "value_type_id": "linkml:Uriorcurie"
                        }
                    ],
                    "sources": [
                        {
                            "resource_id": "infores:biogrid",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:biogrid"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:monarchinitiative"
                            ]
                        }
                    ]
                },
                "7e8fb0a590bff1f4fc71564d36bd2bc5": {
                    "predicate": "biolink:interacts_with",
                    "subject": "NCBIGene:4607",
                    "object": "NCBIGene:84676",
                    "attributes": [],
                    "sources": [
                        {
                            "resource_id": "infores:string",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:string"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:monarchinitiative"
                            ]
                        }
                    ]
                },

A similar example is TTN (HGNC:12403 / NCBIGene:7273)

Colleen Xu · Answer 11 · Fri Feb 09 2024 04:35:31 GMT+0800 (China Standard Time)

Knowledge source infores IDs used by this resource

From Kevin Schaper (Translator Slack link)

NOT all of these infores IDs actually exist in the infores registry (v3-monarch-nonexist-infores.txt), or they may exist and not have complete entries/xref wiki pages. For the infores IDs I've seen in the responses, the following have issues:
- medgen: no xref to wiki page. Okay because it's an aggregator?
- orphanet: no xref to wiki page. Is a primary source
- biogrid: no xref to wiki page. Is a primary source
currently only have 2 aggregators at most
- future changes?
  - monarchinitiative/medgen/omim line to monarchinitiative/hpo-annotations/medgen/omim
  - phenio/etc lines to monarchinitiative/phenio/etc (maybe a bug right now?).

current possible knowledge source combos on edges

aggregator knowledge source	primary knowledge source
infores:monarchinitiative	infores:agbase
infores:monarchinitiative	infores:alzheimers-university-of-toronto
infores:monarchinitiative	infores:aruk-ucl
infores:monarchinitiative	infores:bgee
infores:monarchinitiative	infores:bhf-ucl
infores:monarchinitiative	infores:biogrid
infores:monarchinitiative	infores:cacao
infores:monarchinitiative	infores:cafa
infores:monarchinitiative	infores:complexportal
infores:monarchinitiative	infores:dflat
infores:monarchinitiative	infores:dibu
infores:monarchinitiative	infores:dictybase
infores:monarchinitiative	infores:disprot
infores:monarchinitiative	infores:ensembl
infores:monarchinitiative	infores:flybase
infores:monarchinitiative	infores:gdb
infores:monarchinitiative	infores:go-central
infores:monarchinitiative	infores:go-noctua
infores:monarchinitiative	infores:goc
infores:monarchinitiative	infores:goc-owl
infores:monarchinitiative	infores:hgnc
infores:monarchinitiative	infores:hgnc-ucl
infores:monarchinitiative	infores:hpa
infores:monarchinitiative	infores:hpo-annotations
infores:monarchinitiative	infores:intact
infores:monarchinitiative	infores:interpro
infores:monarchinitiative	infores:lifedb
infores:monarchinitiative	infores:mgi
infores:monarchinitiative	infores:mtbbase
infores:monarchinitiative	infores:ntnu-sb
infores:monarchinitiative	infores:orphanet
infores:monarchinitiative	infores:panther
infores:monarchinitiative	infores:parkinsonsuk-ucl
infores:monarchinitiative	infores:phi-base
infores:monarchinitiative	infores:pinc
infores:monarchinitiative	infores:pombase
infores:monarchinitiative	infores:reactome
infores:monarchinitiative	infores:rgd
infores:monarchinitiative	infores:rhea
infores:monarchinitiative	infores:rnacentral
infores:monarchinitiative	infores:roslin-institute
infores:monarchinitiative	infores:sgd
infores:monarchinitiative	infores:string
infores:monarchinitiative	infores:syngo
infores:monarchinitiative	infores:syngo-ucl
infores:monarchinitiative	infores:syscilia-ccnet
infores:monarchinitiative	infores:uniprot
infores:monarchinitiative	infores:wb
infores:monarchinitiative	infores:xenbase
infores:monarchinitiative	infores:yubiolab
infores:monarchinitiative	infores:zfin
infores:monarchinitiative, infores:alliancegenome	infores:flybase
infores:monarchinitiative, infores:alliancegenome	infores:mgi
infores:monarchinitiative, infores:alliancegenome	infores:rgd
infores:monarchinitiative, infores:alliancegenome	infores:sgd
infores:monarchinitiative, infores:alliancegenome	infores:wormbase
infores:monarchinitiative, infores:alliancegenome	infores:zfin
infores:monarchinitiative, infores:medgen	infores:omim
infores:phenio	infores:HsapDv
infores:phenio	infores:bfo
infores:phenio	infores:chebi
infores:phenio	infores:cl
infores:phenio	infores:eco
infores:phenio	infores:emapa
infores:phenio	infores:envo
infores:phenio	infores:fao
infores:phenio	infores:fbbt
infores:phenio	infores:fma
infores:phenio	infores:fypo
infores:phenio	infores:go
infores:phenio	infores:hp
infores:phenio	infores:iao
infores:phenio	infores:ma
infores:phenio	infores:mondo
infores:phenio	infores:mp
infores:phenio	infores:mpath
infores:phenio	infores:nbo
infores:phenio	infores:ncbitaxon
infores:phenio	infores:obi
infores:phenio	infores:ogms
infores:phenio	infores:pato
infores:phenio	infores:po
infores:phenio	infores:pr
infores:phenio	infores:ro
infores:phenio	infores:so
infores:phenio	infores:uberon
infores:phenio	infores:upheno
infores:phenio	infores:wbbt
infores:phenio	infores:wbphenotype
infores:phenio	infores:xpo
infores:phenio	infores:zfa
infores:phenio	infores:zp

Colleen Xu · Answer 12 · Wed Feb 21 2024 15:54:08 GMT+0800 (China Standard Time)

@tokebe

This is now ready for deployment!

I've tested that our ingest/post-processing of provenance from the API is working for all operations
made some recent adjustments today (2/20) based on the API updates I saw (using subject/object namespace parameters, adding gene <-> anatomy operations). Retested and all is working locally.

PRs for push to Prod:

override to SmartAPI yaml w/ x-bte annot: biothings/bte-server#16
post-processing for v3 Monarch API biothings/api-respone-transform.js#63

Once these are fully deployed to Prod, we can update the registered yaml (PR) and start the process of removing the override...

Colleen Xu · Answer 13 · Wed Feb 21 2024 16:04:56 GMT+0800 (China Standard Time)

Notes

Because we are ingesting the provenance info from the external API's responses, we aren't certain of the infores values that will be in the response. This may make it tricky to ensure the infores entries/xref wiki pages are always set up. It also complicates any effort to get allowlist/denylist working
we don't have a list of the possible MetaEdges (combos of subject category/subject namespace/predicate/object category/object namespace)

Stuff to follow up on

Short-term

EDIT, DONE: Sierra and Kevin confirmed 2/28 that it's fine to change infores, and we could deprecate biolink-api infores...

double-check on the use of infores:monarchinitiative (switching to this for info.x-translator.infores) vs infores:biolink-api (what we were using before).
- may involve adjusting the xref wiki page, to make more obvious that we are using their non-TRAPI API?
- BTE is still using this field to set the upstream resource ID for the bte/service-provider source element...
- We made the switch because when using biolink-api, BTE doesn't generate a source-object for it and the provenance chain would be wonky (probably because of the post-processing to instead use the API response provenance info).

Example of wonky behavior

The edge source info would look like this:

service-provider trapi says biolink-api is upstream of it
but there's no entry for biolink-api (and...then monarchinitiative should be upstream?)
then there's entries for monarchinitiative and its upstream sources (which include the primary). These come from post-processing the raw API response.

                    "sources": [
                        {
                            "resource_id": "infores:hpo-annotations",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:hpo-annotations"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:biolink-api"
                            ]
                        }
                    ]

Once these are fully deployed to Prod, we can update the registered yaml (PR) and start the process of removing the override...

Longer-term?

EDIT: moving to separate issues

ability to do "scrolling" GET queries to get all the items (involving the offset parameter and total field in the response). Currently, we only get 500 items per input ID/query (noted in previous post "Querying the v3 API")
investigate the new query options? subject/object category, taxon, namespace: example
- look at what adds coverage, is good to have as separate operations (namespace, species context?). For example, is there cell-level/organelle-level gene-expression info?
annotating more MetaEdges (not covered by past operations)

click to see MetaEdges

Chem to Pathway: unclear how helpful this is, since chemicals seem generic (water, ADP, ATP...). Example. 1 Predicate: participates_in
- their prefix Reactome differs from what we use (REACT)...so this may require extra post-processing support
  (depends on how helpful setting the subject/object namespace is)
- unclear if other Pathway namespaces exist
Gene to Pathway: previously chose not to annotate because MyGene also covers this info. Also has prefix issue (see Chem to Pathway above). 1 Predicate: participates_in
Gene to GO BiologicalProcess (989349 items): previously chose not to annotate because MyGene also covers this info. Each kind has multiple possible predicates, lots of diff primary knowledge sources
- actively_involved_in (797927)
- acts_upstream_of_or_within (180729)
- acts_upstream_of (9327)
- acts_upstream_of_or_within_positive_effect (507)
- acts_upstream_of_positive_effect (506)
- acts_upstream_of_or_within_negative_effect (178)
- acts_upstream_of_negative_effect (175)
Gene to GO MolecularActivity (848151 items): see notes for BiologicalProcess above
- enables (841330)
- contributes_to (6821)
Gene to GO CellularComponent (745837 items): see notes for BiologicalProcess above
- located_in (502225)
- active_in (145515)
- part_of (94049)
- colocalizes_with (4048)
Gene to Gene ortholog: previously chose not to annotate because MyGene also covers this info. 1 predicate (orthologous_to, 551383 hits). Seems to be 1 primary knowledge source (panther)

Colleen Xu · Answer 14 · Wed Feb 28 2024 09:10:54 GMT+0800 (China Standard Time)

I've confirmed that the changes have been deployed to BTE Prod. So I've:

merged NCATS-Tangerine/translator-api-registry#140 and updated the registration, so now the SmartAPI registry is using the v3 Monarch yaml.
updated name of API in API_LIST biothings/bte-server@067b820: from "Biolink API" -> "Monarch API" to reflect the SmartAPI yaml update

How I tested

We can tell that BTE is using the new v3 Monarch API by doing a test query for the gene-disease-contributesTo operation - which didn't exist in the old API. If we have edges with the contributes_to predicate and the enhanced sources info (omim <- medgen <- monarchinitiative <- service provider), then we know that BTE is using the new SmartAPI yaml and api-response-transform code.

POST to Monarch-API-only, thru BTE: https://bte.transltr.io/v1/smartapi/d22b657426375a5295e7da8a303b9893/query


{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Gene"],
                    "ids": ["HGNC:6294", "HGNC:9652"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:contributes_to"]
                }
            }
        }
    }
}

Should get this edge in the response, showing the contributes_to predicate and the enhanced sources info (omim <- medgen <- monarchinitiative <- service provider)

                "1ff8a4f5ade3639ebd6b951ac8984627": {
                    "predicate": "biolink:contributes_to",
                    "subject": "NCBIGene:3784",
                    "object": "MONDO:0100316",
                    "attributes": [],
                    "sources": [
                        {
                            "resource_id": "infores:omim",
                            "resource_role": "primary_knowledge_source"
                        },
                        {
                            "resource_id": "infores:medgen",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:omim"
                            ]
                        },
                        {
                            "resource_id": "infores:monarchinitiative",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:medgen"
                            ]
                        },
                        {
                            "resource_id": "infores:service-provider-trapi",
                            "resource_role": "aggregator_knowledge_source",
                            "upstream_resource_ids": [
                                "infores:monarchinitiative"
                            ]
                        }
                    ]
                }

BUT before closing this, I'd like to discuss "stuff to follow up on" with Jackson @tokebe first...(open new issues?)

Colleen Xu · Answer 15 · Thu Feb 29 2024 09:27:48 GMT+0800 (China Standard Time)

Discussed the "stuff to follow up on" with Jackson and Sierra/Kevin (see edited post). I'll open new issues, but we're ready to close this one