PubMed Health Disparity

Health disparity is an important research area focusing on studying the health outcomes for people of disadvantaged identities and backgrounds. One issue however is that many researchers who study health disparities often move on to study other areas of health research due to a lack of NIH funding. Additionally, many of the researchers who study health disparities are members of those disadvantaged communities themselves, which could be a possible indicator for some sort of discrimination in the funding selection process.

In this project, we want to see if the data reflects the trend of researchers moving away from studying health disparities by analyzing the articles about health disparity on PubMed. Our role in this project is to procure the data needed for this analysis. One way to retrieve these articles is to take articles matching the search term "disparity" on PubMed, however this does not always return results relevant to health disparity, and may include articles about other kinds of disparity. In order to get around this, we manually annotate the "relevance" of each article and train a model to predict whether or not a given article is about health disparity.

Some File Locations

Main folder:

sirius.bc.edu:/data/yangael/pubmed

Articles by relevant authors:

sirius.bc.edu:/data/yangael/pubmed/data/relevant_authors_articles/saved_articles

Procedure

Get PubMed articles matching the keywords "disparity", "inequity", and "inequality"

Article retrieval
Manually annotate 2,000 relevant articles and 2,000 irrelevant articles

Annotation and Formatting data
Train a model on our annotated data to predict the relevance of a given article

ML Classification Models and BERT Classification Models
Retrieve all PubMed articles matching our three search terms

Getting relevant articles
Run the model on these articles to predict their relevance

Getting relevant articles
Get the authors for the articles predicted to be relevant

Getting authors of relevant articles
Search PubMed for all other articles by these authors

Getting articles by relevant authors
Run the model on all these articles to predict their relevance

Predicting relevance of articles by relevant authors

Setup

First, we want to clone this repository on the cluster. Run the following commands on Terminal:

ssh -p 22022 BC_USERNAME@sirius.bc.edu
git clone https://github.com/yingyangle/pubmed.git
cd pubmed

To set up the conda environment for our scripts, run the following commands on the cluster from the main animals folder after cloning it. I named the environment tf but you can name it something else if you want.

module load anaconda/3-2018.12-P3.7
conda create -n tf python=3.9.6
conda activate tf

Runtime: 2 min.

To install the necessary packages, make sure your conda environment is activated and run the following script.

pip install -r tools/requirements.txt

Runtime: 5 min.

You can use the following script to double check that all the package downloads went smoothly.

python tools/import.py

Runtime: 5 min.

Finally, we want to copy over the files that were too large to be uploaded to GitHub. I changed my folder permissions so you should be able to access my files as long as you're in the prudhome user group, but please let me know if you have trouble accessing anything! Make sure to run the following commands from your cloned pubmed folder.

cp /data/yangael/pubmed/PubMed-and-PMC-w2v.bin .
cp /data/yangael/pubmed/results/predictions* results/
cp -r -n /data/yangael/pubmed/data/* data/
cp -r -n /data/yangael/pubmed/bertdata/* bertdata/
cp -r -n /data/yangael/pubmed/saved_models/* saved_models/

Runtime: ~2.5 hours

Any files that are already existing in your directory won't be overwritten by the ones in mine. If you want your files to be overwritten by mine, you can remove the -n flag.

If you've already set up the environment and files, you can skip most of the previous steps and just make sure to activate the environment before running anything. Also make sure to include this line in your .pbs files.

conda activate tf

Datasets

First, here's an overview of our annotated datasets. All of our annotated data is located in the /data folder. You'll find our annotated data saved as two types of files:

annotations*.csv - These files contain only the relevance annotations for each article as well as its PubMed ID and article title.

article_info*.csv - These files contain all the metadata for each article, including PubMed ID, article title, abstract, authors, and publication date.

You'll also see numbers such as 1, 2, and 1+2 in the filenames for the files mentioned above. These indicate our different batches of annotations.

Dataset 1 was annotated by Prof. Prud'hommeaux's colleagues.

Dataset 2 was annotated by me (Christine).

Dataset 1+2 just combines all the annotations from Dataset 1 and Dataset 2.

Now let's look at the process for annotating articles.

Article retrieval

First, we need to retrieve a list of articles to annotate. To do this, we query PubMed using the following search terms: disparity, inequity, and inequality. Since our goal is just to annotate 2,000 examples of relevant articles and 2,000 examples of irrelevant articles, we can limit our search to the top 10,000 results for each search term for now. To do this, we can set the MAX_RESULTS variable to be 10000.

python get_articles.py 'disparity'
python get_articles.py 'inequity'
python get_articles.py 'inequality'

The data for the retrieved articles will be saved as pubmed_articles_*.csv and authors_*.json.

Then, we can combine all the results from our search terms into one .csv file for easy annotating.

python combine_articles.py

This script combines all the pubmed_articles_*.csv files into a single file to_annotate.csv, and combined all the authors_*.json into authors.json. It also excludes any articles that have already been annotated in a given existing annotations file (e.g. annotations1.csv). This existing annotations file can be set with the EXISTING_ANNOTATIONS_FILE variable.

Annotation

Here are the labels we use in our annotation:

0 = irrelevant 1 = relevant 3 = unsure

To annotate the data, go through the articles in to_annotate.csv and mark

Formatting data

After annotating the data, we want to format our data into annotations*.csv and article_info*.csv files.

python split_annotations_info.py

Then we can also combine the annotations from all annotation batches.

python combine_annotations.py

This will combine annotations1.csv and annotations2.csv into annotations1+2.csv, and does the same thing for the article_info*.csv files.

ML Classification Models

After annotating our data, we can try training a model on our annotations to predict the relevance of a given article based on its title or abstract, or both concatenated together. We can first start with trying some classical machine learning algorithms.

The classification models we test include:

Logistic Regression
K Neighbors (k=3)
Linear SVM
RBF SVM
Gaussian Naive Bayes
Gaussian Process
Decision Tree
Random Forest
Ada Boost
MLP Neural Net
Quadratic Discriminant Analysis (QDA)

We can use existing word2vec embeddings trained specifically on PubMed and PMC to convert our text input to vectors. These embeddings are saved in the root folder as PubMed-and-PMC-w2v.bin. Our input and output for the models will look something like this:

Input: w2v embedding of the article's title, abstract, or both Output: whether or not the article is relevant to health disparities

To train and evaluate our models, we can run the following script:

python ml.py INPUT_TYPE DATASET

Runtime: 1-7 min.

The evaluation results will be saved in PubMed_ML_Models.csv and graphed as results/w2v_classification*.png.

BERT Classification Models

The files for training and evaluating BERT classification models trained using keras are located in the root pubmed folder.

Prepare data for training a model

In order to run the scripts to train and evaluate BERT models, we first need to correctly format the data directory for the train and test data by running format_bert_data.py. This will take each of the data/annotations*.csv and data/article_info*.csv files and format them as sublists as described in the sample table above. It will automatically create formatted datasets for all the datasets.

python format_bert_data.py DATASET_TYPE

Runtime: 2-15 min. (depending on the script arguments)

Available DATASET_TYPE options include (you can also run the script with no arguments to see a list of options):

split - prepares data for an 80/20 train test split
cv - prepares data for 5-fold cross validation
full - prepares data for using all data as training data
unannotated - prepares unannotated data for being predicted

If you run the script with DATASET_TYPE='unannotated', you'll need to add the following two arguments, or run it on the command line rather than submitting a job so that you can be prompted to fill in these variables.

python format_bert_data.py 'unannotated' ARTICLE_INFO NICKNAME

where ARTICLE_INFO is the path for a file containing the article info for the articles you want to include in the data set. The file must contain the PubMed ID, title, and abstract for each article. The NICKNAME variable is what you want to name this unannotated dataset (the result will look like bertdata/bertdata_NICKNAME).

The resulting data directory will look something like this:

 /bertdata/bertdata_*
    	/train
	    	/relevant
    			291.txt
    			...
    		/irrelevant
    			72.txt
    			...
    	/test
    		/relevant
    			6.txt
    			...
    		/irrelevant
    			103.txt
    			...

Train and evaluate a model

Before running the following scripts, make sure you've created the correctly formatted dataset directories as described in this step.

In the example commands below, I use a number of placeholder variables which you can adjust to be what you want. Here's a summary of what most of the placeholder variables can be set to:

DATASET = [1, 2, 1+2] INPUT_TYPE = [title, abstract, title+abstract] BERT_MODEL = [bert, smallbert, albert, electra, talkingheads, experts_pubmed]

To train a model using an 80/20 train test split:

python bert.py BERT_MODEL INPUT_TYPE DATASET

To train a model using 5-fold cross validation:

python bert_CV.py BERT_MODEL INPUT_TYPE DATASET NUM_FOLDS

Evaluate a model on a different dataset

Another way to evaluate how well our model might perform on a different set of data is to train it on one dataset and evaluate it on another. This is useful since the datasets we have are annotated by different people, so we can see how our model handles a slightly different set of data.

python bert_eval.py BERT_MODEL INPUT_TYPE TRAIN_DATASET TEST_DATASET

You can also run this script with no arguments to try all the different BERT_MODEL and INPUT_TYPE combinations with all the different TRAIN_DATASET and TEST_DATASET combinations.

python bert_eval.py

The results for this script will be saved in PubMed_BERT_Models_Eval.csv.

Using model to generate predictions

Once we've trained some models, we can also use one of these fine-tuned models saved in the /saved_models folder to predict the relevance of some more articles. Before running this, make sure to create the unannotated bertdata/bertdata* folder as described here.

python bert_predict.py UNANNOTATED_DATASET_NICKNAME

Searching for Relevant Articles

After training and evaluating to find our best predictive model, we want to move on to steps (4) through (8) of our procedure and use our model to find relevant authors and articles.

Getting relevant articles

To get a list of all relevant articles, first we want to retrieve all the articles on PubMed matching our search terms (disparity, inequity, inequality), setting the MAX_RESULTS limit to be very high (e.g. 1,000,000) so we can get as many articles as PubMed allows.

python get_articles.py "disparity"
python get_articles.py "inequity"
python get_articles.py "inequality"

This script will save the retrieved articles as data/articles_*.csv.

After retrieving these articles, we want to predict the relevance of each of these articles using the best model we trained.

python bert_predict.py BERT_MODEL INPUT_TYPE TRAIN_DATASET UNANNOTATED_DATASET_NICKNAME

The results will be saved as results/predictions_unannotated_*.csv.

Getting authors of relevant articles

After getting a list of relevant articles, we want to get a list of the authors of these relevant articles so that we can analyze the trajectory of their research.

python get_relevant_authors.py PREDICTIONS_FILE

where PREDICTIONS_FILE is the file containing the prediction results (e.g. 'results/predictions_unannotated_*.csv').

This script uses the author info in data/article_authors.json to get the info for each author, and gets the author for relevant articles in data/annotations1+2.csv and PREDICTIONS_FILE. The resulting list of relevant authors and their metadata will be saved in a subfolder as data/relevant_authors_articles/authors_relevant.json.

Getting articles by relevant authors

Once we've gotten our list of relevant authors, we want to search for all other articles written by these authors.

python search_authors.py

This script will take the list of relevant authors in data/relevant_authors_articles/authors_relevant.json and save the article information for all articles written by each author. The article data will be saved in the folders data/relevant_authors_articles/saved_articles/json and data/relevant_authors_articles/saved_articles/csv, with both folders containing the same data but into different file formats. Each folder contains a .csv or .json file for each author's articles.

The script will also keep a list of authors it has searched in data/relevant_authors_articles/authors_already_searched.json so that you can pick up where you left off if needed. There's also a list of articles that have already been saved in data/relevant_authors_articles/articles_already_saved.json and a list of authors for which the article search failed saved in data/relevant_authors_articles/failed_authors.json, just for reference.

Predicting relevance of articles by relevant authors

Finally, once we have a full list of all the articles written by our relevant authors, we can run our best predictive model again to predict the relevance of all these articles we found. Before running this, make sure to create the unannotated bertdata/bertdata* folder as described here.

python bert_predict.py BERT_MODEL INPUT_TYPE TRAIN_DATASET UNANNOTATED_DATASET

The results will be saved as results/predictions_unannotated_*.csv.

Submitting Jobs to the Cluster

Submit individual jobs to the cluster

To submit a job to the cluster, you can edit the .pbs files in the pubmed folder for convenience (so you don't have to make a bunch of new ones). It's fine to submit multiple jobs to the queue using the same .pbs filename, even if the contents of the files are different.

The go.pbs and misc.pbs files contain example commands for running the different scripts, so you can uncomment whichever script you want to run and change the arguments to what you want. Just make sure to update the walltime and mem settings to be appropriate for the script you're running.

**Also make sure to change the email in the second line of the .pbs file to be your email, so that you'll receive email notifications instead of me when the job starts and finishes. Or, you can delete the line entirely if you don't want any notifications.

Submit a bunch of jobs at once

To make it easier to submit a bunch of jobs to the cluster, I've included a script tools/write_pbs.py so that you can mass-submit jobs for a certain script, running through all the dataset and model combinations you want to try. Make sure to run this script in the main pubmed folder like the previous scripts.

python tools/write_pbs.py ACTION_TO_RUN

Available options for ACTION_TO_RUN include (you can also run the script with no arguments to see a list of options):

bert_split - runs bert.py for all models for all datasets and input types
bert_cv - runs bert_CV.py for all models for all datasets and input types
ml_split - runs ml_models.py for all datasets and input types

Before running this script, you'll also want to make sure to update the walltime and mem settings to be what you want. You can do this by editing the tools/template.pbs file (e.g. nano tools/template.pbs). It's usually better to be safe than sorry, so try to set a walltime that you're pretty sure won't time out (although higher walltimes usually also take longer to get to the front of the queue).

Deleting multiple jobs

Okay sometimes you might realize you accidentally submitted a bunch of jobs using the previous script and there was a typo somewhere in your script. Instead of deleting all these jobs one by one you can use the following script to delete a bunch of consecutive jobs at once:

python tools/delete_jobs.py FIRST_JOB_ID LAST_JOB_ID

Just put the job ID of the first and last job you want to delete, and the script will delete those and all the jobs with ID number in between them.

Checking successful jobs

After the jobs complete, you'll get a bunch of output logs like *.pbs.e* and *.pbs.o*. To make it easier to check the success of these jobs, you can run the following script which will print out the names of jobs that were unsuccessful:

python tools/check_jobs.py

Most of our scripts print out RUNTIME: #### at the very end, so this script just checks the *.pbs.o* file to see if it contains this line. Not all scripts have this line though so just double check to see if the script you're checking has this.

yingyangle / pubmed