Telescope workflow implementation: OpenAire

Question

Telescope workflow implementation: OpenAire

aroelo opened this issue 4 years ago · comments

There are 3 ways of bulk accessing the openaire data:

OpenAIRE Research Graph Dumps
OAI-PMH
Bulk access to projects

OpenAIRE Research Graph Dumps
Can be downloaded from Zenodo (https://zenodo.org/search?page=1&size=20&q=OpenAIRE%20Research%20Graph%20Dump) or explored through their beta portal.
There is one dump available from 18-12-2019 and one from 03-11-2020, which also has an updated json schema.

Each publication on Zenodo contains several dumps/files, the 2019 one is slightly different than 2020.
2019 files:

publication.gz: metadata records about research literature (includes types of publications listed here)
dataset.gz:: metadata records about research data (includes the subtypes listed here)
software.gz:: metadata records about research software (includes the subtypes listed here)
orp.gz: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.gz: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.gz: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
project.gz: metadata records about projects funded by a given funder.
<funder>_result.gz: metadata records about research results (publications, datasets, software, and other research products) funded by a given funder.

2020 files:

publication_[part].tar: metadata records about research literature (includes types of publications listed here)
dataset.tar: metadata records about research data (includes the subtypes listed here)
software.tar: metadata records about research software (includes the subtypes listed here)
otherresearchproduct.tar: metadata records about research products that cannot be classified as research literature, data or software (includes types of products listed here)
organization.tar: metadata records about organizations involved in the research life-cycle, such as universities, research organizations, funders.
datasource.tar: metadata records about providers whose content is available in the OpenAIRE Research Graph. They includes institutional and thematic repositories, journals, aggregators, funders' databases.
project.tar: metadata records about projects funded by a given funder.
relation_[part].tar: metadata records about relations between entities in the graph
communities_infrastructures.tar: metadata records about research communities and research infrastructures

This image from https://doi.org/10.5281/zenodo.4238939 helps to understand the relationship between these files.

OAI-PMH
The OAI-PMH harvester is available as well, one note:

Currently the OAI-PMH publisher is not supporting incremental harvesting.
Although the usage of the OAI parameters 'from' and 'until' is handled by the OAI publisher, the datestamps of metadata records are updated about every week.

I'm not sure what they mean with 'the datestamps of metadata record are updated about every week'.
Considering the data size it might be best to initially download the dumps instead of using the OAI-PMH harvester. Perhaps the harvester can be used to update the data regularly with newly added/edited records, but I'm skeptical since they mention above that incremental harvesting is not supported.

Bulk access to projects
The APIs offer custom access to metadata about projects funded by a selection of international funders for the DSpace and EPrints platforms. The currently supported funding streams and relative codes are:

FP7: The 7th Framework Programme funded by the European Commission
WT: Wellcome Trust funding programme
H2020: Horizon2020 Programme funded by the European Commission
FCT: The funding programme of Fundação para a Ciência e a Tecnologia, the national funding agency of Portugal
ARC: the funding programme of the Australian Research Council
NHMRC: the funding programme of the Australian National Health and Medical Research Council
SFI: Science Foundation Ireland
HRZZ: Croatian Science Foundation
MZOS: Ministry of Science, Education and Sports of the Republic of Croatia
MESTD: The Ministry of Education, Science and Technological Development of Serbia
NWO: The Netherlands Organisation for Scientific Research

I'm not sure if this is of interest to us. I think this project data is included in the Zenodo files as well and this is just an alternative easy way if you're interested in a specific project.

Questions:

How often is a new dump expected to be released on Zenodo? I can only find the 2 publications so far.
Will we combine all the different files in a single table or separate tables?
For the OAI-PMH, how is the datestamp determined and how often are new records added? The newest record seems to be from 2020-05-12.

Richard Hosking · Answer 1 · Thu Dec 03 2020 05:37:39 GMT+0800 (China Standard Time)

Given how recent the last Zenodo dump was, I think that might be a good start getting that. It's always hard to know how often it will be updated though, if at all. A general example of downloading content from Zenodo will likely be useful, as there are other datasets also hosted there, which might become future telescopes. I'm guessing the OAI-PMH stuff is difficult to harvest sequentially due to lots of existing records always being updated, but it's hard to tell from their description.

Separate tables will be fine, that resembles how MAG looks, so I can write some SQL for bringing that key bits together.

In terms of scheduling, that's a difficult one, it's almost a only_once run. There is value in just mapping out all the schemas, and potentially doing some parts manually so we can have an initial view into the data. However, if turning it into a full telescope isn't much further work, then having an example of getting content from Zenodo is helpful in it's own right.