Loading BioKG in Neo4j
DimitrisAlivas opened this issue · comments
Hey folks,
First of all, I'd like to thank you for this contribution. Having a unified biomedical KG is an essential resource for research in this domain.
I would like to use BioKG in my work. Specifically, we would like to train a link predictor to perform the task of drug-target interaction prediction and utilise the benchmarks you so thoughtfully include, in order to compare the performance of our DTI approach vs others.
For this, I thought it would be useful to have the BioKG final data (in /data/biokg/
) uploaded to a Neo4j property graphstore, to enable querying for specific benchmarks (using hyper-relations for example: a relation DTI with qualifier (benchmark: 'FDA') or DDI with qualifier (benchmark: MINERAL)). Furthermore, having BioKG as a Neo4j ready graph could increase usability and visibility, so I plan on making it public once I manage to get it done.
The 2 issues I'm facing:
-
The number of unique entities/relations that I see after loading the
.tsv
data in Pandas is different than the ones reported in the paper, so I've been looking into what could've gone wrong. -
The way I create the Neo4j graph is as follows:
- Load all entity types from metadata + properties.
- Get unique id's and use them to create nodes with Cypher
- Load the links
- Match on the (already) created nodes + node_id and if both subject and object match - create the link.
Following the logic above everything runs smoothly up to the point where I try to load the links that include COMPLEXES + PATHWAYs for which I cannot find any matches for.
If I understand the data model correctly, complex_ids
exist only as part of the LINKS file and do not appear in the properties + metadata files (?).
Which identifiers are the ones that I should use to create the unique Complex
nodes?
Apologies for the lengthy post and for potential inaccuracies on my end.
Minor comment:
A typo I found while reading your documentation:
Line 93 in 92a71e7
The relation should be PROTEIN_DISEASE if I'm not mistaken.
Thank you again for your great contribution! I would greatly appreciate any help :-)
Cheers!
Hi Dimitris,
Thanks a lot for the great description and details mentioned in this issue, It has been a good year since I have last touched on this project and I have moved a few jobs now. However, I want to provide you with some support in relation to the issues you have.
I am going to give you a very lazy answer now and probably in a few days I can look at this more carefully and give you a better answer.
In relation to issue 1, have you tried to use the ready-produced KG located in the releases section? It should be the same as in the paper.
My quick guess is that this problem can be caused by a change in a source dataset. I vaguely remember for example that DrugBank and Reactome made some changes after we published this changed the output of our script which was reported in the paper.
I could not get issue 2 properly, so I will try to look at it again later and try to give you an answer.
In relation to the typo, Thanks for noticing that. It is a small thing I know, but would you be kind and change it and make a pull request? I will accept it immediately.
Thanks a lot.
Sameh
Hey Sameh,
Thank you for taking the time and for your answer! I think it makes a lot of sense given the frequency of updates in a lot of the integrated data sources (e.g. DrugBank as you mentioned)
I used the instruction on the readme of the repository to compile biokg
in order to make sure it includes the latest versions of the sources. I will also check the version in the releases as you suggest, the goal here is to get the most accurate, hence up-to-date graph for our experiments.
Regarding issue 2, it's related to the semantics of pathways, complexes in relation to proteins, cause they do affect the way I convert the data for Neo4j. Thanks for taking some more time to look into it.
Best,
Dimitrios