Benja1972 / LifeScience_KG

POC Life Science knowledge graph in Neo4j from public sources

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KG construction

Intro

We use CURIEs identifiers https://en.wikipedia.org/wiki/CURIE. They can be resolved in https://identifiers.org/ - The Identifiers.org Resolution Service provides consistent access to life science data using Compact Identifiers. (TODO: alternative https://bioregistry.io/)

Glossary

  • id - Primary reference

  • labels - type of entity

  • xref - Secondary refs for linking etc

  • tui - UMLS semantic type codes

  • cui - UMLS codes

  • umls_sty - UMLS semantic type literal

Ontologies

Ontology is used to classify entities. In BioMedical domains there are long list of well-known and curated ontologies. We select one ontology per entity type as primary ontology for these entity.

  • ATC - ontology for chemical compounds. The Anatomical Therapeutic Chemical (ATC) Classification System is used for the classification of active ingredients of drugs according to the organ or system on which they act and their therapeutic, pharmacological and chemical properties. It is controlled by the World Health Organization Collaborating Centre for Drug Statistics Methodology (WHOCC)

  • MeSH - mixed ontology of bio-medical terms, chemical compounds, diseases etc.

  • DOID - Disease. The Disease Ontology has been developed as a standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts through collaborative efforts of biomedical researchers, coordinated by the University of Maryland School of Medicine, Institute for Genome Sciences. The Disease Ontology semantically integrates disease and medical vocabularies through extensive cross mapping of DO terms to MeSH, ICD, NCI's thesaurus, SNOMED and OMIM.

Entities

List of datasets containing entity instances (metadata) and information on relationships, interactions etc., between entities.

Relationships

Output format

Output format is presented as table where specifically defined columns names and name of table help to process data in connected graph. All processed tables could be pushed to graph construction script and KG will be built automatically.

::: centering Example of table for entities{#fig:Example-of-table} :::

  • Drug_product.csv - Table name is exact Entity type

  • ID,Product_type,Application_number - simple columns names - metadata for node

  • REL:has_ingredient - column starts with REL: is for outgoing relations with name of relation after ":" It contains list of IDs of entities to be linked with entity in registry (current line) (n)-[has_ingredient]->(m)

  • LER:manufactures - column starts with LER: is for incoming relations with name of relation after ":" It contains list of IDs of entities to be linked with entity in registry (current line) (n)<-[manufactures]-(m)

Example of table is presented in Fig. 1{reference-type="ref" reference="fig:Example-of-table"}

Final Knowledge Graph is presented on Fig. 2{reference-type="ref" reference="fig:Graph"}

::: centering Graph :::

Data Review

Ontology/Semantic type (probable)

Genes

Proteins

Compounds

Drugs

Biologics

Vaccines, blood and tissue products, and biotechnology. New biologics are required to go through a premarket approval process called a Biologics License Application (BLA), similar to that for drugs.

Product

Medical Devices

Food

Diseases

Pathways

Clinical Trials

Patents

Intearactions

Combo

APIs

Knowledge Graphs

Notes

L-.1667em.25emY-.125emX LaTeX

Elsevier Biology KG

Entity type Quantity
Cell    4,181
Cell object 609
Cell process    14,906
Clinical parameter  5,284
Disease 22,433
Genetic Variant 157,344
Small molecule (drug)   1,057,236
Protein/gene    141,779
Complex 992
Functional class    5,485
Organ   3,857
Tissue  579
Virus   25,287
Treatment   82
Total   1,440,054

About

POC Life Science knowledge graph in Neo4j from public sources


Languages

Language:Python 100.0%