Open Data Detection in Publications (ODDPub)

ODDPub is a text mining algorithm that parses a set of publications and detects which publications disseminated Open Data or Open Code together with the publication. It is tailored towards biomedical literature.

Authors

Nico Riedel (nico.riedel@bihealth.de), Miriam Kip, Evgeny Bobrov - QUEST Center for Transforming Biomedical Research, Berlin Institute of Health

Publication

More information on the development and validation of the algorithm can be found in the bioRxiv preprint https://www.biorxiv.org/content/10.1101/2020.05.11.088021v1. The related training and validation datasets can be found under https://osf.io/yv5rx/.

Installation

The latest version of the algorithm is structured as an R package and can easily be installed with the following command:

# install.packages("devtools") # if devtools currently not installed
devtools::install_github("quest-bih/oddpub")

Description

The algorithm searches for several categories of similar keywords in each sentence. Multiple categories have to match for a single sentence to trigger a detection. Among keyword categories are categories for specific biomedical repositories as well as their corresponding accession numbers (as regular expressions), general-purpose repositories or different file formats typically used to distribute raw data in the supplement.

Additionally, Open Code dissemination is detected using keywords categories for source code or code repositories.

Usage

The package exposes four functions that allow the following workflow:

oddpub::pdf_convert(PDF_folder, output_folder)

Converts PDFs contained in one folder to txt-files and saves them into the output folder. Requires external poppler library for the conversion.

PDF_text_sentences <- oddpub::pdf_load(pdf_text_folder)

Loads all text files from given folder.

open_data_results <- oddpub::open_data_search(PDF_text_sentences)

Actual Open Data detection. Returns for each file if Open Data or Open Code is detected. Additionally returns the identified Open Data/Code categories as well as the detected sentences, which can be deactivated using the additional parameter detected_sentences = FALSE.

open_data_results <- oddpub::open_data_search_parallel(PDF_text_sentences)

Paralellized version of the algorithm that starts several parallel processes using the foreach and doParallel package to speed up the detection. Number of processes can be set with the parameter cluster_num (default value: 4).

To validate the algorithm, we manually screened a sample of 792 publications that were randomly selected from PubMed. On this validation dataset, our algorithm detects Open Data publications with a sensitivity of 0.73 and specificity of 0.97.

Detailed description of the keywords

In the following, we give an overview over the different parts of the keywords used in the algorithm

Those are the combined keyword categories that are searched in the full text. If a hit was detected in any of these combined categories, the paper is flagged as Open Data. For definitions of the individual keyword categories, see below.

Combined Keyword Category	Keywords
Field-specific repositories	FIELD_SPECIFIC_REPO NEAR ACCESSION_NR NEAR (AVAILABLE NOT (NOT_AVAILABLE OR WAS_AVAILABLE))
General-purpose repositories	REPOSITORIES NEAR (AVAILABLE NOT (NOT_AVAILABLE OR WAS_AVAILABLE))
Dataset	(DATASET_NAME OUTER_SYM DATASET_NUMBER) OR SUPPLEMENTAL_DATASET
Supplemental table or data	SUPPLEMENTAL_TABLE NEAR_WD(10) (FILE_FORMATS OR ALL_DATA)
Supplementary raw/full data with specific file format	(ALL_DATA NOT NOT_DATA) NEAR_WD(10) FILE_FORMATS
Data availability statement	DATA_AVAILABILITY NEAR_WD(30) (“doi” OR ACCESSION_NR OR REPOSITORIES)
Dataset on Github	DATA NEAR GITHUB NEAR (AVAILABLE NOT (NOT_AVAILABLE OR WAS_AVAILABLE))

Additionally, the detection of Open Code statements is done with the following keywords:

Combined Keyword Category	Keywords
Source-code availability	SOURCE_CODE NEAR AVAILABLE NOT (NOT_AVAILABLE OR WAS_AVAILABLE OR UPON_REQUEST)
Supplementary Source-code	SOURCE_CODE NEAR SUPPLEMENT
All Open Code keywords combined	SOURCE_CODE NEAR (SUPPLEMENT OR (AVAILABLE NOT (NOT_AVAILABLE OR WAS_AVAILABLE OR UPON_REQUEST))

Individual keyword categories:

Definitions	Description	Keywords
x NEAR y	Are the two keywords (or groups of keywords) x and y in the same sentence?
x NEAR_WD(n) y	Second definition of NEAR. This time counts how many words are between the two keywords. If the number lies below a cutoff value (e.g. 10 words), the two input words are considered "near". This additional definition is needed for cases like "S2 Table. Raw data. https://doi.org/10.1371/journal.pone.0158039.s002 (XLS)"
NOT y	y is not included in the same sentence
AVAILABLE	Abbreviation for a set of words that frequently occur to denote that the data have been made available in some way	("included" OR "deposited" OR "released" OR "is provided" OR "are provided" OR "contained in" OR "available" OR "reproduce" OR "accessible" OR "can be accessed" OR "submitted" OR "can be downloaded" OR "reported in" OR "uploaded" OR "are public on")
WAS_AVAILABLE	Set of words that indicate that data is not made available in the paper, but instead that data from a different source was used	("was provided" OR "were provided" OR "was contained in" OR "were contained in" OR "was available" OR "were available" OR "was accessible" OR "were accessible" OR "deposited by" OR "were reproduced")
NOT_AVAILABLE	Set of negated availability phrases	("not included" OR "not deposited" OR "not released" OR "not provided" OR "not contained in" OR "not available" OR "not accessible" OR "not submitted")
UPON_REQUEST	Phrase describing that data are only available upon request	("upon request" OR "on request" OR "upon reasonable request")
ALL_DATA	Set of words describing all data or raw data	("all data" OR "all array data" OR "raw data" OR "full data set" OR "full dataset" OR "crystallographic data" OR "subject-level data")
NOT_DATA	Set of negations of the data phrases	("not all data" OR "not all array data" OR "no raw data" OR "no full data set" OR "no full dataset")
FIELD_SPECIFIC_REPO	Set of names and abbreviations of field-specific repositories	("GEO" OR "Gene Expression Omnibus" OR "European Nucleotide Archive" OR "National Center for Biotechnology Information" OR "European Molecular Biology Laboratory" OR "EMBL-EBI" OR "BioProject" OR "Sequence Read Archive" OR "SRA" OR "ENA" OR "MassIVE" OR "ProteomeXchange" OR "Proteome Exchange" OR "ProteomeExchange" OR "MetaboLights" OR "Array-Express" OR "ArrayExpress" OR "Array Express" OR "PRIDE" OR "DNA Data Bank of Japan" OR "DDBJ" OR "Genbank" OR "Protein Databank" OR "Protein Data Bank" OR "PDB" OR "Metagenomics Rapid Annotation using Subsystem Technology" OR "MG-RAST" OR "metabolights" OR "OpenAgrar" OR "Open Agrar" OR "Electron microscopy data bank" OR "emdb" OR "Cambridge Crystallographic Data Centre" OR "CCDC" OR "Treebase" OR "dbSNP" OR "dbGaP" OR "IntAct" OR "ClinVar" OR "European Variation Archive" OR "dbVar" OR "Mgnify" OR "NCBI Trace Archive" OR "NCBI Assembly" OR "UniProtKB" OR "Protein Circular Dichroism Data Bank" OR "PCDDB" OR "Crystallography Open Database" OR "Coherent X-ray Imaging Data Bank" OR "CXIDB" OR "Biological Magnetic Resonance Data Bank" OR "BMRB" OR "Worldwide Protein Data Bank" OR "wwPDB" OR "Structural Biology Data Grid" OR "NeuroMorpho" OR "G-Node" OR "Neuroimaging Informatics Tools and Resources Collaboratory" OR "NITRC" OR "EBRAINS" OR "GenomeRNAi" OR "Database of Interacting Proteins" OR "IntAct" OR "Japanese Genotype-phenotype Archive" OR "Biological General Repository for Interaction Datasets" OR "PubChem" OR "Genomic Expression Archive" OR "PeptideAtlas" OR "Environmental Data Initiative" OR "LTER Network Information System Data Portal" OR "Global Biodiversity Information Facility" OR "GBIF" OR "Integrated Taxonomic Information System" OR "ITIS" OR "Knowledge Network for Biocomplexity" OR "Morphobank" OR "Kinetic Models of Biological Systems" OR "KiMoSys" OR "The Network Data Exchange" OR "NDEx" OR "FlowRepository" OR "ImmPort" OR "Image Data Resource" OR "Cancer Imaging Archive" OR "SICAS Medical Image Repository" OR "Coherent X-ray Imaging Data Bank" OR "CXIDB" OR "Cell Image Library" OR "Eukaryotic Pathogen Database Resources" OR "EuPathDB" OR "Influenza Research Database" OR "Mouse Genome Informatics" OR "Rat Genome Database" OR "VectorBase" OR "Xenbase" OR "Zebrafish Model Organism Database" OR "ZFIN" OR "HIV Data Archive Program" OR "NAHDAP" OR "National Database for Autism Research" OR "NDAR" OR "PhysioNet" OR "National Database for Clinical Trials related to Mental Illness" OR "NDCT" OR "Research Domain Criteria Database" OR "RdoCdb" OR "Synapse" OR "UK Data Service" OR "caNanoLab" OR "ChEMBL" OR "IoChem-BD" OR "Computational Chemistry Datasets" OR "STRENDA" OR "European Genome–phenome Archive" OR "European Genome phenome Archive" OR "accession number" OR "accession code" OR "accession numbers" OR "accession codes")
ACCESSION_NR	Set of regular expressions that represent the accession number formats of different (biomedicine-related) repositories	("G(SE\|SM\|DS\|PL)[[:digit:]]{2,}" OR "PRJ(E\|D\|N\|EB\|DB\|NB)[:digit:]+" OR "SAM(E\|D\|N)[A-Z]?[:digit:]+" OR "[A-Z]{1}[:digit:]{4}" OR "[A-Z]{2}[:digit:]{6}" OR "[A-Z]{3}[:digit:]{5}" OR "[A-Z]{4,6}[:digit:]{3,}" OR "GCA_[:digit:]{9}\.[:digit:]+" OR "SR(P\|R\|X\|S\|Z)[[:digit:]]{3,}" OR "(E\|P)-[A-Z]{4}-[:digit:]{1,}" OR "[:digit:]{1}[A-Z]{1}[[:alnum:]]{2}" OR "MTBLS[[:digit:]]{2,}" OR "10.17590" OR "10.5073" OR "10.25493" OR "10.6073" OR "10.15468" OR "10.5063" OR "[[:digit:]]{6}" OR "[A-Z]{2,3}_[:digit:]{5,}" OR "[A-Z]{2,3}-[:digit:]{4,}" OR "[A-Z]{2}[:digit:]{5}-[A-Z]{1}" OR "DIP:[:digit:]{3}" OR "FR-FCM-[[:alnum:]]{4}" OR "ICPSR [:digit:]{4}" OR "SN [:digit:]{4}")
REPOSITORIES	Set of names of general-purpose repositories	("figshare" OR "dryad" OR "zenodo" OR "dataverse" OR "DataverseNL" OR "osf" OR "open science framework" OR "mendeley data" OR "GIGADB" OR "GigaScience database" OR "OpenNeuro")
FILE_FORMATS	Set of file formats	("csv" OR "zip" OR "xls" OR "xlsx" OR "sav" OR "cif" OR "fasta")
GITHUB	Github for data has to be treated differently, as we need additional information that data and not only code was shared on Github	(“github”)
DATA	Data keywords only used for the Github category	("data" OR "dataset" OR "datasets")
ALL_DATA	Set of words describing all data or raw data	("all data" OR "all array data" OR "raw data" OR "full data set" OR "full dataset" OR "crystallographic data" OR "subject-level data")
NOT_DATA	Set of negations of the data phrases	("not all data" OR "not all array data" OR "no raw data" OR "no full data set" OR "no full dataset")
DATA_AVAILABILITY	Set of headings for data availability section	("Data sharing" OR "Data Availability Statement" OR "Data Availability" OR "Data deposition" OR "Deposited Data" OR "Data Archiving" OR "Availability of data and materials" OR "Availability of data" OR "Data Accessibility" OR "Accessibility of data")
x OUTER y	makes all possible pairwise combinations of strings x and y
x OUTER_SYM y	symmetrical outer product with both orderings of vectors
SUPPLEMENTAL_TABLE_NAME	Set of keywords denoting supplemental tables or files	("supplementary table" OR "supplementary tables" OR "supplemental table" OR "supplemental tables" OR "table", "tables" OR "additional file" OR "file", "files")
SUPPLEMENTAL_TABLE_NUMBER	Possible numbers of supplemental tables or files	("S[[:digit:]]", "[[:digit:]]", "[A-Z]{2}[[:digit:]]")
SUPPLEMENTAL_TABLE	Numbered supplemental table or file	SUPPLEMENTAL_TABLE_NAME OUTER SUPPLEMENTAL_TABLE_NUMBER
SUPPLEMENTAL_DATASET	Numbered supplemental dataset	("supplementary data [[:digit:]]{1,2}" OR "supplementary dataset [[:digit:]]{1,2}" OR "supplementary data set [[:digit:]]{1,2}" OR "supplemental data [[:digit:]]{1,2}" OR "supplemental dataset [[:digit:]]{1,2}" OR "supplemental data set [[:digit:]]{1,2}")
DATASET_NAME	Names for datasets	("data" OR "dataset" OR "datasets" OR "data set" OR "data sets")
DATASET_NUMBER	Number for dataset	("S[[:digit:]]{1,2}")
DATA_JOURNAL_DOIS	Set of Open Data Journal DOIs for which the publication DOI is checked (from filename, not part of actual keyword search)	("10.1038/s41597-019-", "10.3390/data", "10.1016/j.dib")
SUPPLEMENT	Set of expression describing the supplement of an article	("supporting information" OR "supplement" OR "supplementary data")
SOURCE_CODE	Set of expressions describing source code	("source code" OR "analysis script" OR "github" OR "SAS script" OR "SPSS script" OR "R script" OR "R code" OR "python script" OR "python code" OR "matlab script" OR "matlab code")

License

ODDPub is available under the MIT license. See the LICENSE file for more info.

tklebel / oddpub