biothings / biothings_explorer

TRAPI service for BioThings Explorer

Home Page:https://api.bte.ncats.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

adding knowledge_level / agent_type (KL/AT) edge-attributes to all edges (Spring 2024 Translator feature)

colleenXu opened this issue · comments

The Translator consortium wants knowledge_level / agent_type (KL/AT) edge-attributes added to all edges.

The format for the edge-attributes is something like this:

{
   "attribute_type_id": "biolink:knowledge_level",
   "value": "knowledge_assertion"
},
{
   "attribute_type_id": "biolink:agent_type",
   "value": "manual_agent"
}
**CLICK HERE** for my interpretation of what KL/AT is and what the terms mean

knowledge_level: in general terms, "where / how" was this knowledge generated?

  • knowledge_assertion: asserted to be true. Google doc says this is the default, since most statements curated from literature / from authoritative knowledgebases count as this
  • logical_entailment: from logic (related to ontologies)
  • prediction: more speculative "hypotheses or possible facts". Google doc says creative-mode overarching edges count as well as any KP of "predictions"
  • statistical_association: using association/correlation predicates, from KPs working with EHR/omics data
  • observation: "we report this is happening" (adverse event / clinical trials)
  • not_provided: can't tell what to pick. Use for text-mined edges, since they aren't picking up those nuances

agent_type: in general terms, "who / what" generated or asserted the knowledge represented on the edge?

  • manual_agent: human decided, made the assertion
  • manual_validation_of_automated_agent: human reviewed/validated what an automated agent generated (very subtle distinction, not clear if we'll use it)
  • automated_agent: software-generated, human didn't decide/review the specific assertion. Can use this term directly, or one of its more-specific children
    • data_analysis_pipeline: statistical association/correlation, using association/correlation predicates (not using rules/inference to say anything bigger/stronger about the relationship)
    • computational_model: using rules/inference to say anything bigger/stronger about the relationship, or some kind of machine learning
    • text_mining_agent: used NLP to get the entities/relationship-type (ID, node category, edge predicate)
    • image_processing_agent: from images (like PFOCR)
  • not_provided: can't tell what to pick

Documentation:

What needs implementing

Our end

  1. Add knowledge_level and agent_type fields to x-bte annotation for Service-Provider-only APIs ➡️ transform into TRAPI edge-attributes. We can coordinate between me and another dev (probably Jackson @tokebe)
  2. add edge-attributes for the edges our tool generates (3 kinds?):
    • for subclass_of: we get these from ontologies/vocabs - both service-provider and BTE return these kinds of edges.
    • for the "inferred" edges built from the subclass_of + KP edge: knowledge_level = logical_entailment, agent_type = automated_agent (according to Matt Brush, Translator Slack link). both service-provider and BTE return these kinds of edges.
    • for the creative-mode "inferred" edge made from a template: knowledge_level = prediction, agent_type = computational_model. Only BTE returns these kinds of edges.

For text-mining / multiomics: two possible options?

  • they update their parsers/we help deploy their API contents so their edges have these edge-attributes. x-bte annotation/BTE is already set up to ingest these automatically.
    • However, there will probably be a staggered deployment through the ITRB instances. Then we can try adding the instance/maturity-specific server urls to their SmartAPI yamls (ex: Text-Mining Targeted), and double-check that this works (that the expected maturities are using the updated APIs)
  • they use the x-bte annotation additions like ours (point 1 above). However, we'd want to check if that works without issue with the already-TRAPI-formatted edge-attributes ingest BTE does.

For TRAPI KPs, we ingest their edge-attributes (so we leave it to them to implement KL/AT on their edges).


Notes:

(1) there seems to be a hierarchy to the values (see automated_agent). We want to keep this in mind if we ever want to query these as QEdge.attribute_constraints (traverse this hierarchy?). We last discussed these kinds of constraints in #482 (comment), but the hierarchy of terms only applied to qualifier stuff.

(2) To team: let's not include attribute_source fields in these edge-attributes (existed in the examples). As confirmed by Matt Brush (Translator Slack link), these are optional fields with the infores ID of "who assigned the KL/AT terms".

I think it's a little complicated to implement (expand to see notes)

  • what about the subclass-related edges, which show up in service-provider-team endpoint responses and BTE responses?
  • service-provider-trapi in Service-Provider-only KP edges
  • biothings-explorer for edges built from templates for creative-mode

(3) Matt Brush said subclass edges from CL, UBERON would also have agent_type=manual_agent (Translator Slack link). We don't support these yet.

Text-Mining / Multiomics KP situation

With Everaldo/Chunlei (Service Provider side), the CI instances of the following BioThings APIs will be updated with KL/AT edge-attributes

(no news yet: Drug response and text-mining targeted)

We'll watch to see if this works as-expected (aka BTE ingests and displays these edge-attributes).

Notes on UMLS "subclass" relationships

The node-expansion module appears to be using a parsed version of the Metathesaurus MRREL.REF file. However, it's not clear to me how the file was parsed. There are 2 types of things I think of, that would have been used:

  • parent/child (REL = PAR/CHD)
  • broader/narrower (REL = RB/RN)

My notes:

  • MRREL contains immediate parent/child relationships (ref: reference manual 2.4.1.1)
  • MRREL has a REL field that can be parent/child, broader/narrower (ref: reference manual 2.4.2, REL abbreviations table on this page)
  • parent/child comes from source vocab, VS broader/narrower added by UMLS editors (humans?) (ref: REL abbreviations table on this page, 2nd page of this paper)

Problems?

Note: Monarch API plans to add KL/AT fields. monarch-initiative/monarch-app#675 If we want to use these, we'd need to adjust our custom post-processing of their responses (as a separate but related issue)

@tokebe

I've added knowledge_level and agent_type fields to all x-bte annotation that needs it. And just-in-case, I think we should add these two edge-attributes to our edge-hash (since we don't want edges created that have multiple KL/AT values from merges).

  • Service-Provider-only stuff (not Text-Mining/Multiomics)
  • only edited the yamls that dev/ci are going to use

And my notes on the curation process:

I've annotated yamls that are still in-progress:

(1) There may be typos in the field names or values! Because I manually added these w/o any automated validation to help >.<. I already fixed a typo (knowledge_type -> knowledge_level).

(2) There were many cases where I wasn't sure what terms to pick:

  • Usually the problem is figuring out how the knowledge was generated, the level of human involvement, or what term to pick (especially when there's an automated pipeline/aggregation of sources/multiple methods involved).
  • Andrew's advice: if you can't figure it out in a few min, just pick not_provided
Trouble assigning both values

AGR disease-gene associations: I'm picking these based on my guesses of what's going on…

  • if it wasn't "via orthology": using knowledge_assertion / manual_agent.
  • if it's "via orthology": using logical_entailment / manual_validation_of_automated_agent

DISEASES:

  • knowledge_level: I picked not_provided. Right now, it's a mix because we don't separate by evidence value
    • text-mined -> not_provided
    • experiments -> statistical_association
    • knowledge -> knowledge_assertion
  • agent_type: I picked automated_agent since I assumed there's an automated pipeline for processing all the sources, regardless of evidence type. But the papers aren't super clear on this (2022, 2015).

MGIgene2pheno: I'm picking knowledge_assertion / manual_agent based on my guesses of what's going on. I've skimmed this FAQ

MyChem:

  • aeolus: picked observation/manual_agent. seems like humans originally made the reports, but an automated pipeline was used to assign IDs.
  • chembl: seems to be manual curation, so I picked knowledge_assertion / manual_agent. But it's a lot of reading to understand exactly where the data is coming from (paper linked by recent update article)
  • drugcentral: using this paper as reference
    • bioactivity: seems to be a mix of manual curation and automatic ingest from other resources ("Current data" -> "Bioactivities and drug targets" section)
    • contraindications, drug use, off-label: manually curated according to first paragraph of intro.
    • adverse events: from faers. Same issue as aeolus, so I picked the same values.
  • fda-orphan-drug-db: picked observation / not_provided. Since it's a database of applications for designations/approvals…

MyDisease:

  • what to do for disgenet (paper, website):
    • knowledge_level: I picked not_provided. Right now, it's a mix because we don't separate by underlying source
    • agent_type: I picked automated_agent since I assumed there's an automated pipeline for processing/integrating all the sources
  • what to do for disease-pheno from hpo-annotations: I'm picking knowledge_assertion / manual_agent based on assumptions. But in the evidence part of "phenotype.hpoa format", it's implied that some info comes from parsing the omim data and I'm not sure how that affects this.

MyGene:

  • what to do for ConsensusPathDB/cpdb (paper, website) - aggregator:
    • knowledge_level: I picked knowledge_assertion. But I don't know - does it depend on what cpdb is doing or what the underlying sources are doing (KEGG, wikipathways, biocarta)
    • agent_type: I picked automated_agent since I assume cpdb is using an automated pipeline to processing/integrating all the sources
  • ncbi-gene: same issues as cpdb, it's an aggregator. Picked same knowledge_level / agent_type as above
  • panther (orthologs): picked knowledge_assertion / computational_model. Paper Figure 4 seems to show that automated pipeline creates orthologs, not really any manual-curation.

repoDB:

  • approved drug indications basically downloaded from drugcentral/drugbank. So I picked knowledge_assertion / automated_agent but maybe another term based on drugcentral/drugbank methods would be better?
  • non-approved drug info from data parsing/cleaning clinicaltrials.gov data. So I picked observation / automated_agent

Issues assigning agent_type

Picked not_provided:

  • foodb: can't find any info on their process. No publication, website says "(obtained from literature)". I can find cases where the food component content is from a different database (phenol explorer)
  • fooddata central: can't tell if their process involves human/manual effort vs automated effort. Seems to report experimental data. Ref: data sources, FAQ
  • hpo gene-to-pheno: can't find any info on their process. Info on webpage's "genes_to_phenotype.txt format" section is vague.
  • monarch: not sure what to pick - depends on underlying primary source? And they may add their own KL/AT assignments
  • pharmgkb: when the relationship wasn't listed here as manually-curated. Then I couldn't tell how this assertion was made

Unsure:

  • bindingdb: could count as manual_agent(website shows ~half the data is "curated")? But I picked manual_validation_of_automated_agent based on the line in "Data Collection" section:

Data imported from other databases, such as PubChem and ChEMBL, are automatically checked for completeness and certain easily detected errors, and any data flagged by these procedures are reviewed manually and corrected if needed.

  • dgidb: seems to use automated pipeline to ingest many resources (ref: 2021 paper, VS 2024 paper is more vague). So I picked automated_agent...
  • ebi-proteins uniprot-to-rhea: I'm assuming we are primarily using Swiss-Prot entries, which are human-curated (ref). But Trembl would be automated_agent...
  • iPTMnet: some info seems to be text-mined, vs imported from curated databases (ref: paper materials and methods). So I put automated_agent
  • pharmgkb: assuming manual_agent. But it's unclear what info in pharmgkb isn't manually curated. There is a list of what is
  • rampDB: I put automated_agent. currently only looking at pathway info, which seems to come from automated pipeline importing from multiple resources: HMDB, KEGG, WikiPathways, Reactome. Plus some manual curation for chemical/metabolite ID mappings.

@tokebe

I've updated the posts above since all the x-bte annotation work is done.

The rest of step 1 (ingesting/formatting the x-bte fields) + step 2 are yours?

@tokebe

For the KL/AT edge-attributes from x-bte annotation...

  • the edge-attribute types are missing the biolink prefixes ("biolink:knowledge_level", "biolink:agent_type")
  • the values are 1-element arrays, when we want them to be strings.

The format for the constructed edges looks correct/good. I saw examples of all 3 cases.

(Based on a quick review only)

Latest commits should fix these.

Related #715 could be done after this issue is reasonably done.

Update on Monarch (earlier comment in this issue):

I've updated the KL/AT assigments for Monarch API operations, using the info provided in monarch-initiative/monarch-app#675 (comment). So we're good for now!

The code was deployed today to Prod as part of the Octopus release. I tested and it's live.

I'm closing this issue because our side of the work is done. However, note that Text-Mining/Multiomics haven't updated their BioThings APIs for all instances to provide KL/AT edge-attributes yet (was keeping notes in a comment here)