adding knowledge_level / agent_type (KL/AT) edge-attributes to all edges (Spring 2024 Translator feature)

Question

adding knowledge_level / agent_type (KL/AT) edge-attributes to all edges (Spring 2024 Translator feature)

colleenXu opened this issue 4 months ago · comments

The Translator consortium wants knowledge_level / agent_type (KL/AT) edge-attributes added to all edges.

The format for the edge-attributes is something like this:

{
   "attribute_type_id": "biolink:knowledge_level",
   "value": "knowledge_assertion"
},
{
   "attribute_type_id": "biolink:agent_type",
   "value": "manual_agent"
}

**CLICK HERE** for my interpretation of what KL/AT is and what the terms mean

knowledge_level: in general terms, "where / how" was this knowledge generated?

knowledge_assertion: asserted to be true. Google doc says this is the default, since most statements curated from literature / from authoritative knowledgebases count as this
logical_entailment: from logic (related to ontologies)
prediction: more speculative "hypotheses or possible facts". Google doc says creative-mode overarching edges count as well as any KP of "predictions"
statistical_association: using association/correlation predicates, from KPs working with EHR/omics data
observation: "we report this is happening" (adverse event / clinical trials)
not_provided: can't tell what to pick. Use for text-mined edges, since they aren't picking up those nuances

agent_type: in general terms, "who / what" generated or asserted the knowledge represented on the edge?

manual_agent: human decided, made the assertion
manual_validation_of_automated_agent: human reviewed/validated what an automated agent generated (very subtle distinction, not clear if we'll use it)
automated_agent: software-generated, human didn't decide/review the specific assertion. Can use this term directly, or one of its more-specific children
- data_analysis_pipeline: statistical association/correlation, using association/correlation predicates (not using rules/inference to say anything bigger/stronger about the relationship)
- computational_model: using rules/inference to say anything bigger/stronger about the relationship, or some kind of machine learning
- text_mining_agent: used NLP to get the entities/relationship-type (ID, node category, edge predicate)
- image_processing_agent: from images (like PFOCR)
not_provided: can't tell what to pick

Documentation:

This seems to be the most up-to-date starting spec (there's some in other branches too)
- Goes with this google doc for more info
- also seems to go with the info in this biolink-model PR https://github.com/biolink/biolink-model/pull/1470/files
- some advice for treats edges
Some discussion in this PR
Some info in this larger doc

What needs implementing

Our end

Add knowledge_level and agent_type fields to x-bte annotation for Service-Provider-only APIs ➡️ transform into TRAPI edge-attributes. We can coordinate between me and another dev (probably Jackson @tokebe)
add edge-attributes for the edges our tool generates (3 kinds?):
- for subclass_of: we get these from ontologies/vocabs - both service-provider and BTE return these kinds of edges.
  - knowledge_level = knowledge_assertion for all vocabs
  - for agent_type = manual_agent for all current vocabs. It may differ in future ones. (Translator Slack link for ontologies, including MONDO/HP/DOID/CHEBI, Translator Slack link for UMLS)
- for the "inferred" edges built from the subclass_of + KP edge: knowledge_level = logical_entailment, agent_type = automated_agent (according to Matt Brush, Translator Slack link). both service-provider and BTE return these kinds of edges.
- for the creative-mode "inferred" edge made from a template: knowledge_level = prediction, agent_type = computational_model. Only BTE returns these kinds of edges.

For text-mining / multiomics: two possible options?

they update their parsers/we help deploy their API contents so their edges have these edge-attributes. x-bte annotation/BTE is already set up to ingest these automatically.
- However, there will probably be a staggered deployment through the ITRB instances. Then we can try adding the instance/maturity-specific server urls to their SmartAPI yamls (ex: Text-Mining Targeted), and double-check that this works (that the expected maturities are using the updated APIs)
they use the x-bte annotation additions like ours (point 1 above). However, we'd want to check if that works without issue with the already-TRAPI-formatted edge-attributes ingest BTE does.

For TRAPI KPs, we ingest their edge-attributes (so we leave it to them to implement KL/AT on their edges).

Notes:

(1) there seems to be a hierarchy to the values (see automated_agent). We want to keep this in mind if we ever want to query these as QEdge.attribute_constraints (traverse this hierarchy?). We last discussed these kinds of constraints in #482 (comment), but the hierarchy of terms only applied to qualifier stuff.

(2) To team: let's not include attribute_source fields in these edge-attributes (existed in the examples). As confirmed by Matt Brush (Translator Slack link), these are optional fields with the infores ID of "who assigned the KL/AT terms".

I think it's a little complicated to implement (expand to see notes)

what about the subclass-related edges, which show up in service-provider-team endpoint responses and BTE responses?
service-provider-trapi in Service-Provider-only KP edges
biothings-explorer for edges built from templates for creative-mode

(3) Matt Brush said subclass edges from CL, UBERON would also have agent_type=manual_agent (Translator Slack link). We don't support these yet.

Colleen Xu · Answer 1 · Wed Apr 03 2024 03:17:47 GMT+0800 (China Standard Time)

Text-Mining / Multiomics KP situation

With Everaldo/Chunlei (Service Provider side), the CI instances of the following BioThings APIs will be updated with KL/AT edge-attributes

Wellness (Gwênlyn):
- Deployed to CI, not Test yet: data/parser update. x-bte operations are now working for CI. For older discussion, see Translator Slack
- DONE: registered SmartAPI yaml adjusted to include all instance URLs (commit)
EHR risk (Kamileh):
- Deployed to CI, not Test yet: data/parser update. x-bte operations are working in CI, but there's formatting issues (Translator Slack link)
- DONE: registered SmartAPI yaml adjusted to include all instance URLs
ClinicalTrials (Kamileh):
- Haven't deployed yet: data/parser update. In the previous request, there were formatting issues (Translator Slack link)
- DONE: registered SmartAPI yaml adjusted to include all instance URLs

(no news yet: Drug response and text-mining targeted)

We'll watch to see if this works as-expected (aka BTE ingests and displays these edge-attributes).

Colleen Xu · Answer 2 · Wed Apr 03 2024 05:21:33 GMT+0800 (China Standard Time)

Notes on UMLS "subclass" relationships

The node-expansion module appears to be using a parsed version of the Metathesaurus MRREL.REF file. However, it's not clear to me how the file was parsed. There are 2 types of things I think of, that would have been used:

parent/child (REL = PAR/CHD)
broader/narrower (REL = RB/RN)

My notes:

MRREL contains immediate parent/child relationships (ref: reference manual 2.4.1.1)
MRREL has a REL field that can be parent/child, broader/narrower (ref: reference manual 2.4.2, REL abbreviations table on this page)
parent/child comes from source vocab, VS broader/narrower added by UMLS editors (humans?) (ref: REL abbreviations table on this page, 2nd page of this paper)

Problems?

Colleen Xu · Answer 3 · Tue Apr 09 2024 12:36:34 GMT+0800 (China Standard Time)

Note: Monarch API plans to add KL/AT fields. monarch-initiative/monarch-app#675 If we want to use these, we'd need to adjust our custom post-processing of their responses (as a separate but related issue)

Colleen Xu · Answer 4 · Tue Apr 09 2024 15:35:29 GMT+0800 (China Standard Time)

@tokebe

I've added knowledge_level and agent_type fields to all x-bte annotation that needs it. And just-in-case, I think we should add these two edge-attributes to our edge-hash (since we don't want edges created that have multiple KL/AT values from merges).

Service-Provider-only stuff (not Text-Mining/Multiomics)
only edited the yamls that dev/ci are going to use
- master branch for most (first commit, second commit + typo fix)
- biolink-4-update for idisk, mychem, repodb, semmeddb, suppkg, ttd.
  - Note that semmeddb is ONLY updated in this branch (commit), due to the major diff in operations between this branch and master branch.
- source_record_urls for bindingdb, rare-source, mygene

Colleen Xu · Answer 5 · Tue Apr 09 2024 15:43:18 GMT+0800 (China Standard Time)

And my notes on the curation process:

I've annotated yamls that are still in-progress:

RaMP-DB (commit) from #705
pharmgkb from #556

(1) There may be typos in the field names or values! Because I manually added these w/o any automated validation to help >.<. I already fixed a typo (knowledge_type -> knowledge_level).

(2) There were many cases where I wasn't sure what terms to pick:

Usually the problem is figuring out how the knowledge was generated, the level of human involvement, or what term to pick (especially when there's an automated pipeline/aggregation of sources/multiple methods involved).
Andrew's advice: if you can't figure it out in a few min, just pick not_provided

Trouble assigning both values

AGR disease-gene associations: I'm picking these based on my guesses of what's going on…

if it wasn't "via orthology": using knowledge_assertion / manual_agent.
if it's "via orthology": using logical_entailment / manual_validation_of_automated_agent

DISEASES:

knowledge_level: I picked not_provided. Right now, it's a mix because we don't separate by evidence value
- text-mined -> not_provided
- experiments -> statistical_association
- knowledge -> knowledge_assertion
agent_type: I picked automated_agent since I assumed there's an automated pipeline for processing all the sources, regardless of evidence type. But the papers aren't super clear on this (2022, 2015).

MGIgene2pheno: I'm picking knowledge_assertion / manual_agent based on my guesses of what's going on. I've skimmed this FAQ

MyChem:

aeolus: picked observation/manual_agent. seems like humans originally made the reports, but an automated pipeline was used to assign IDs.
chembl: seems to be manual curation, so I picked knowledge_assertion / manual_agent. But it's a lot of reading to understand exactly where the data is coming from (paper linked by recent update article)
drugcentral: using this paper as reference
- bioactivity: seems to be a mix of manual curation and automatic ingest from other resources ("Current data" -> "Bioactivities and drug targets" section)
- contraindications, drug use, off-label: manually curated according to first paragraph of intro.
- adverse events: from faers. Same issue as aeolus, so I picked the same values.
fda-orphan-drug-db: picked observation / not_provided. Since it's a database of applications for designations/approvals…

MyDisease:

what to do for disgenet (paper, website):
- knowledge_level: I picked not_provided. Right now, it's a mix because we don't separate by underlying source
- agent_type: I picked automated_agent since I assumed there's an automated pipeline for processing/integrating all the sources
what to do for disease-pheno from hpo-annotations: I'm picking knowledge_assertion / manual_agent based on assumptions. But in the evidence part of "phenotype.hpoa format", it's implied that some info comes from parsing the omim data and I'm not sure how that affects this.

MyGene:

what to do for ConsensusPathDB/cpdb (paper, website) - aggregator:
- knowledge_level: I picked knowledge_assertion. But I don't know - does it depend on what cpdb is doing or what the underlying sources are doing (KEGG, wikipathways, biocarta)
- agent_type: I picked automated_agent since I assume cpdb is using an automated pipeline to processing/integrating all the sources
ncbi-gene: same issues as cpdb, it's an aggregator. Picked same knowledge_level / agent_type as above
- how it gets its go-annotations
panther (orthologs): picked knowledge_assertion / computational_model. Paper Figure 4 seems to show that automated pipeline creates orthologs, not really any manual-curation.

repoDB:

approved drug indications basically downloaded from drugcentral/drugbank. So I picked knowledge_assertion / automated_agent but maybe another term based on drugcentral/drugbank methods would be better?
non-approved drug info from data parsing/cleaning clinicaltrials.gov data. So I picked observation / automated_agent

Issues assigning agent_type

Picked not_provided:

foodb: can't find any info on their process. No publication, website says "(obtained from literature)". I can find cases where the food component content is from a different database (phenol explorer)
fooddata central: can't tell if their process involves human/manual effort vs automated effort. Seems to report experimental data. Ref: data sources, FAQ
hpo gene-to-pheno: can't find any info on their process. Info on webpage's "genes_to_phenotype.txt format" section is vague.
monarch: not sure what to pick - depends on underlying primary source? And they may add their own KL/AT assignments…
pharmgkb: when the relationship wasn't listed here as manually-curated. Then I couldn't tell how this assertion was made

Unsure:

bindingdb: could count as manual_agent(website shows ~half the data is "curated")? But I picked manual_validation_of_automated_agent based on the line in "Data Collection" section:

Data imported from other databases, such as PubChem and ChEMBL, are automatically checked for completeness and certain easily detected errors, and any data flagged by these procedures are reviewed manually and corrected if needed.

dgidb: seems to use automated pipeline to ingest many resources (ref: 2021 paper, VS 2024 paper is more vague). So I picked automated_agent...
ebi-proteins uniprot-to-rhea: I'm assuming we are primarily using Swiss-Prot entries, which are human-curated (ref). But Trembl would be automated_agent...
iPTMnet: some info seems to be text-mined, vs imported from curated databases (ref: paper materials and methods). So I put automated_agent
pharmgkb: assuming manual_agent. But it's unclear what info in pharmgkb isn't manually curated. There is a list of what is
rampDB: I put automated_agent. currently only looking at pathway info, which seems to come from automated pipeline importing from multiple resources: HMDB, KEGG, WikiPathways, Reactome. Plus some manual curation for chemical/metabolite ID mappings.

Colleen Xu · Answer 6 · Wed Apr 10 2024 10:12:34 GMT+0800 (China Standard Time)

@tokebe

I've updated the posts above since all the x-bte annotation work is done.

The rest of step 1 (ingesting/formatting the x-bte fields) + step 2 are yours?

Colleen Xu · Answer 7 · Fri Apr 12 2024 12:20:08 GMT+0800 (China Standard Time)

@tokebe

For the KL/AT edge-attributes from x-bte annotation...

the edge-attribute types are missing the biolink prefixes ("biolink:knowledge_level", "biolink:agent_type")
the values are 1-element arrays, when we want them to be strings.

The format for the constructed edges looks correct/good. I saw examples of all 3 cases.

(Based on a quick review only)

Jackson Callaghan · Answer 8 · Sat Apr 13 2024 00:50:45 GMT+0800 (China Standard Time)

Latest commits should fix these.

Jackson Callaghan · Answer 9 · Thu Apr 18 2024 01:47:42 GMT+0800 (China Standard Time)

Related #715 could be done after this issue is reasonably done.

Colleen Xu · Answer 10 · Thu May 30 2024 06:26:27 GMT+0800 (China Standard Time)

Update on Monarch (earlier comment in this issue):

I've updated the KL/AT assigments for Monarch API operations, using the info provided in monarch-initiative/monarch-app#675 (comment). So we're good for now!

Colleen Xu · Answer 11 · Fri Jun 14 2024 15:33:51 GMT+0800 (China Standard Time)

The code was deployed today to Prod as part of the Octopus release. I tested and it's live.

I'm closing this issue because our side of the work is done. However, note that Text-Mining/Multiomics haven't updated their BioThings APIs for all instances to provide KL/AT edge-attributes yet (was keeping notes in a comment here)