IntelCompH2020 / nihmporter

Tool to download NIH's ExPORTER Data Catalog

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nihmporter is Python software to download and pack the data published by the National Institute of Health.

Installation

You can use make_conda_environment.sh to build a proper Anaconda environment (by default, named nih), or inspect it to see the exact requirements.

Usage

Activate the above environment and run

# after activating the appropriate conda environement
./import.py

It should result in some feather/pickle (as of July 2021, huge feather files cause memory issues) files, each one storing a Pandas DataFrame. In any one of them, the same record might (most likely will) show up more than once since, until its final release, the information of a contract is updated in different files (which import.py stitches together) at successive dates. For more details see the About section.

The script also produces a bunch of csv files which subset the above feather/pickle files into some data exploited by the (extra) utiliy connectivity_stats.py.

Re-runs

If the script is re-run (in the same directory), many already existing files will be reused (i.e., not downloaded again). In particular, whenever the program is about to download some zip file, it will only do so if it is not already present, or if the homonymous file in the server is more recent (in which case the local file will be overwritten).

About

Tool to download NIH's ExPORTER Data Catalog

License:MIT License


Languages

Language:Python 63.7%Language:Jupyter Notebook 35.5%Language:Shell 0.8%