inventory_2022 (Work in Progress)

This is a public repository of the code used for the Biodata Resource Inventory performed in 2022. This project is an effort by the Global Biodata Coalition to conduct a comprehensive inventory of the global infrastructure of biological data resources. A large portion of this effort is dedicated to being able to periodically update the inventory using the methods developed here.

To meet these goals, natural language processing (NLP) methods are applied to journal articles obtained from EuropePMC. First, articles are classified based on their titles and abstracts, to predict if they describe biodata resources. Then, of those articles that are predicted to describe biodata resources, named entity recognition (NER) is employed to predict the resource's name. Further metadata is gathered from the various fields obtained by querying EuropePMC. To aid in reproducibility and reuse, Snakemake pipelines were developed for automation of training, prediction, and updating the inventory.

Workflow overview

Data curation

The manual curation has already been performed, using the full corpus obtained by querying EuropePMC. Titles and abstracts from ~1600 randomly selected papers were used for manual classification. This created the classifier training set. For those papers that were deemed to represent a biodata resource during manual curation, named entities were manually extracted from titles and abstracts, such as the reource name, URL, and description. This created the NER model training set.

graph TD
    query(EuropePMC Query) --> corpus[(Full corpus)]
    corpus -- random subset--> manclass(Manual Classification);
    manclass -- Not Biodata Resource --> neg[Negative]
    manclass -- Biodata Resource --> pos[Positive]
    neg --> classtrain[(Classifier training set)]
    pos --> classtrain
    pos --> ner(Manual named entity extraction)
    ner -- Common Name --> nertrain[(NER training set)]
    ner -- Full Name --> nertrain

Classifier Training

The manually classified subset of the corpus is split into training, validation, and test (holdout) sets. Several pretrained BERT models are provided with the same training and validation data. The final classifier model is chosen based on the highest F1 score on the validation set. This is the classifier used in the final inventory. Final model performance is evaluated on the held-out test set.

graph TD
    classset[(Classifier training set)]
    classset --> split(Data Splitting)
    split --> train[(train)]
    split --> val[(val)]
    split --> test[(test)]
    subgraph Training
        train --> trainer
        val --> trainer
        models[Pretrained models] --> trainer(training and selection)
        trainer -- best model --> classifier{{Classifier}}
    end
    test ----> eval(Evaluation)
    classifier --> eval

NER Model training

The set of manually extracted named entities is split into training, validation, and test (holdout) sets. Several pretrained BERT models are provided with the same training and validation data. The final NER model is chosen based on the highest F1 score on the validation set. This is the NER model used in the final inventory. Final model performance is evaluated on the held-out test set.

graph TD
    nerset[(NER training set)]
    nerset --> split(Data Splitting)
    split --> train[(train)]
    split --> val[(val)]
    split --> test[(test)]
    subgraph Training
        train --> trainer
        val --> trainer
        models[Pretrained models] --> trainer(training and selection)
        trainer -- best model --> ner{{NER Model}}
    end
    test ----> eval(Evaluation)
    ner --> eval

Inventory

Once the classifier and NER models have been trained and selected, they are applied to the full corpus. Those papers that are classified as biodata resource by the trained classifier are passed to the trained NER model for extracting attributes of the resource such as resource name and description. Other scripts will be used to glean other information, such as resource URLs, authors, country of origin, etc.

graph TD
    corpus[(Full corpus)]
    corpus --> classifier{{Classifier}}
    classifier --> neg[Negative]
    classifier --> pos[Positive]
    pos --> ner{{NER Model}}
    pos --> regex(regex)
    pos --> scrape(APIs)
    ner -- names --> attr[Resource Information]
    regex -- URL --> attr
    scrape -- authors, country --> attr

Repository Structure

.
├── config/          # Workflow configuration files
├── data/            # Manual curation files and data splits
├── snakemake/       # Snakemake pipelines and rules
├── src/             # Python scripts
├── tests/           # pytest scripts
├── .gitignore
├── LICENSE
├── Makefile         # Make targets for easy running of steps
├── README.md
├── requirements.txt
├── running_pipeline.ipynb
└── updating_inventory.ipynb

Installation

There are several ways to install the dependencies for this workflow.

Pip

If installing with pip, ensure you have Python version 3.8. Older or newer versions may not work.

$ python3 --version
Python 3.8.12

Then you can install Python dependencies using pip.

A make command is available for installing dependencies.

$ make setup

Alternatively, to install them manually:

$ pip install -r requirements.txt

Then download punkt:

$ python3
>>> import nltk
>>> nltk.download('punkt')

Anaconda

To create the environment in your $HOME directory, run:

$ conda env create -f config/environment.yml
$ conda activate inventory_env

Or you can create the environment in this repository by running:

$ conda env create -f config/environment.yml -p ./env
$ conda activate ./env

Then download punkt:

$ python3
>>> import nltk
>>> nltk.download('punkt')

Running Tests

A full test suite is included to help ensure that everything is running as expected. To run the full test suite, run:

$ make test

Running the workflow

Dry run

To see what steps would be run in the workflow, a dry run can be run:

$ make dryrun_reproduction

Reproducing original results

To run the pipeline from a notebook in Colab, follow the steps in running_pipeline.ipynb.

Alternatively, to run the pipeline from the command-line, run:

$ make train_and_predict

If Make is unavailable, run

$ snakemake -s snakemake/train_predict.smk --configfile config/train_predict.yml -c1

The above commands run the Snakemake pipeline. If you wish to run the steps manually, see src/README.md.

Updating the inventory

Before running the automated pipelines, if there is not a file out/last_query_date/last_query_date.txt, it must first be created. In that file place the date at which you want the query to begin (should align with date of last query).

Note: There should only be one file matching each pattern out/classif_train_out/best/best_checkpt.txt and out/ner_train_out/best/best_checkpt.txt

To run the pipeline from a notebook in Colab, follow the steps in updating_inventory.ipynb. To run from the command line, follow these steps.

First, make sure that the trained classifier and NER models are present at out/classif_train_out/best/best_checkpt.txt and out/ner_train_out/best/best_checkpt.txt.

If you do not have trained models, and do not want to perform training, they can be downloaded with:

# Add code here for getting models!

Next, make sure that output from previous updates have been saved elsewhere, as the old results must be deleted.

To remove the outputs of previous run:

$ rm -rf out/new_query

Then the pipeline for updating results can be run:

$ make update_inventory

If Make is unavailable, run

$ snakemake -s snakemake/update_inventory.smk --configfile config/update_inventory.yml -c1

The above commands run the Snakemake pipeline. If you wish to run the steps manually, see src/README.md.

Adjusting configurations

The Snakemake pipelines are built such that they capture the workflow logic, while all configurations are stored separately. This makes it possible to adjust the workflows without changing source code or the Snakemake pipelines.

Configurations for reproducing original results are in config/train_predict.yml such as train/validation/split ratios and output directories. Configurations for updating the inventory are in config/update_inventory.yml.

Configurations regarding model training parameters are stored in config/models_info.tsv, such as number of epochs, and convenient model names as well as official HuggingFace model names.

The EuropePMC query string is stored in config/query.txt.

Authorship

Dr. Heidi Imker, Global Biodata Coalition
Kenneth Schackart, Global Biodata Coalition
Ana-Maria Istrate, Chan Zuckerberg Initiative

k8hertweck / inventory_2022