pdessauw / cord19-cdcs-nist

Curated Archive for Covid-19 Research Challenge Dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Curated Archive for COVID-19 Research Challenge Dataset (cord19-cdcs-nist)

This GitHub repository contains a downloadable snapshot of National Institute of Standards and Technology's COVID-19 Data Repository, curated from the COVID-19 Open Research Dataset (CORD-19) provided by the Allen Institute for AI.

The COVID-19 Data Repository provides searchable CORD-19 data and metadata, including full-text extracted from the original CORD-19 JavaScript Object Notation (JSON) files and entities identified using the en_ner_bionlp13cg_md NER model trained on the BIONLP13CG corpus. It is built using the Configurable Data Curation System (CDCS) developed at NIST

Downloading the Data

The purpose of this repository is to provide a platform-neutral means for bulk downloads of curated COVID-19 data. These downloadable archives are versioned using GitHub Releases, based on the Data Repository's schema and time-stamped archival dates, making programmatic access to the latest data (or, consistent dependency management for reproducibility) much easier for users.

To download, head over to the releases page and select a desired release and zip-archived format, or simply download the latest JSON, XML, or CSV versions at those links directly.

Data Packages

To further facilitate rapid interface and reproducible data science work-flows, this repository builds data packages that can directly interface with common statistics languages, usable through separately installable libraries that assemble data and tools for analyzing the CORD-19 data in one, convenient place:

Language Repository
Python cv-py

More languages are certainly possible, depending on community need. Data packages can be downloaded directly from this repositories releases page, or through instructions found at the language-specific repositories above. More information can be found at the readme inside each language-specific <lang>-interface folder.

About

Curated Archive for Covid-19 Research Challenge Dataset

License:Other


Languages

Language:Python 100.0%