NLeSC / rcn_py

RCN is a python package to analyze Research Collaboration Network

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using Graph Database

ZNBai opened this issue · comments

commented

At the moment we are using SQLite databases, it is fine for the current amount of data, but in the future the amount of data will become much larger, so I should try graph databases such as Neo4j, which has a higher performance and runs much faster.

commented

Progress I have completed functions to store data from information in scopus csv files to Neo4j nodes and edges, and is currently storing all papers from 2022 that have Dutch researchers involved.

Why Scopus? Scopus has very comprehensive paper data, especially its metadata contains details of authors' affiliations, countries and paper keywords (which are not available on other paper search websites)

How? As the number of papers involving Dutch researchers in just one year is 50,000+, the Scopus API does not offer to handle such a large amount of data. Therefore, I use the Scopus Document Search website (which requires academic IPs, such as the UvA VPN). The query string is as follows:
PUBYEAR > 2015 AND PUBYEAR < 2024 AND ( LIMIT-TO ( OA , "all" ) ) AND ( LIMIT-TO ( AFFILCOUNTRY , "Netherlands" ) ) AND ( LIMIT-TO ( PUBSTAGE , "final" ) ) AND ( LIMIT-TO ( PUBYEAR , 2022 ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) )
Using this statement we can get: papers (in 2022) with researchers working in Dutch institutions among the authors. So the authors in the data we obtain are most Dutch researchers, and researchers from other countries who have collaborated with them.

Neo4j database structure:

  • Two types of nodes: Person and Publication;
  • One type of relationship: IS_AUTHOR_OF

Person Node properties:
scopus_id (string), name (string), affiliation (stirng), country (string), keywords (list), year of published papers (list), subject (list)

Publication Node properties:
doi (string), title (string), cited_by (num), year (string), keywords (list), subject (list)

IS_Author relationship properties:
author_name (string), title (string), year (string)

commented

It takes a long time to store data (storing 20,000 papers' metadata costs 6h+, but there are 50,000+ papers every year)
, so I can store the data while I do the next tasks (like find visualization tools that can connect to Neo4j).

Solved After creating CONSTRAINT for node Person and Publication, it only takes 20min to store a year's paper data.