medline

Create a basic local medline instance for high throughput informatics

Motivation

NCBI requests that eutils requests be limited to 3/second, which makes some high throughput analyses of Pubmed data challenging. It is not difficult to create your own local version of Pubmed/medline built on ElasticSearch (a convenient REST interface and management system layered on top the Lucene indexing and search library).

Pre-requisites

You will need a fairly robust server to index these files using elasticsearch. Once they are indexed, you can transition to a smaller machine for queries (though I have not experimented enough to determine minimum requirements for running queries). The r4.8xlarge on AWS is a great choice as it has plenty of RAM to commit 32GB to the JVM running Elasticsearch (the maximum recommended), while leaving additional RAM for pre-processing incoming datafiles. Also, it has 32 vCPUs, meaning you can download, pre-process, and index a couple dozen data files simultaneously. On this platform, the whole process only takes a few hours which at spot instance pricing will run you less than $3. To facilitate transitioning servers after indexing, I save my incoming data files and indexes each on separate EBS volumes that can be detached and either deleted (in the case of incoming data files) or attached to a different server (in the case of index files).

You will need a recent version of the Java Runtime Environment (OpenJDK's JRE or Oracle's JRE), and python to follow the steps outlined below. In a Debian type environment, these requirements can be met like so:

sudo apt-get update
sudo apt-get install python
sudo apt-get install python-pip
sudo apt-get install openjdk-8-jre-headless 
sudo pip install elasticsearch

You will also need about 200GB of free disk space to hold the Medline data files for indexing. However, these XML files hold a lot of information we will not retain, so the final ElasticSearch index is < 30GB.

Finally, you will need ElasticSearch.

Download the data

First we generate a file list for wget. You can just do this by cut and paste in your favorite editor, but we can also accomplish it via the command line:

wget ftp.ncbi.nlm.nih.gov/pubmed/baseline
#updated from original
grep -o -P 'pubmed.*?.gz' baseline | uniq | sed -e 's/^/ftp:\/\/ftp.ncbi.nlm.nih.gov\/pubmed\/baseline\//' > filelist

Then fetch the files:

wget -i filelist
gunzip *.gz

Or if you have multiple cores, which of course you do, you can speed things up with xargs (here for 8 cores):

cat filelist | xargs -n 1 -P 10 wget
ls *.gz | xargs -n 1 -P 10 gunzip

Create the index

Set specific mappings to deal with DOI and things

python create_index.py

Index the data

Now all we need to do is load and index the data into elasticsearch. Assuming elastic search is running on your local machine on port 9200, it's as easy as:

ls data/base/*.xml | xargs -n 1 -P 10 python parse_bulk.py

This creates a 'medline' index on elasticsearch, loads the records, and indexes them. This will take a couple hours. Check the logfile created (parse.log) to make sure everything went ok. In particular, review the outpuf of

grep root parse.log

To make sure 30,000 records were imported from each xml file (except the last one which will have fewer). You can get an exact counts of records uploaded thusly:

grep root ../parse.log | cut -f 3 -d' ' | paste -sd+ | bc

It should be around 26 million.

Note that the sciprt parse_bulk.py is pretty basic. It extracts just PMID, title, article type, and abstract from each record in the XML files, and then indexes them with the standard analyzer. Some cleanup of the abstract is performed to remove extraneous XML tags.

This seems to be a reasonable approach (the standard analyzer is well suited for english language and runs everything through a lower case transform for case-insensitive searching). However, you might want to import more fields and/or be more sophisticated with your indexing. Hopefully the script is a useful starting point.

Indexing the daily update files

Each day, additional articles are added to Pubmed. Between annual releases, these new records are released as daily updates. We will want to add these to have a complete database (and, ideally, repeat this process periodically or just prior to any significant analysis of the medline data). The work flow is almost exactly as above, just targeting files from a different directory on the NCBI ftp server. And, since these files are very small, they should download and index very quickly.

mkdir daily
cd daily
wget ftp.ncbi.nlm.nih.gov/pubmed/updatefiles
grep -o -P 'pubmed.*?.gz' updatefiles | uniq | sed -e 's/^/ftp:\/\/ftp.ncbi.nlm.nih.gov\/pubmed\/updatefiles\//' > filelist
cat filelist | xargs -n 1 -P 8 wget
ls *.gz | xargs -n 1 -P 8 gunzip
ls data/daily/*.xml | xargs -n 1 -P 10 python parse_bulk.py

Queries

By PMID

curl -XGET 'localhost:9200/medline/_search?pretty' -H 'Content-Type: application/json' -d '{"size":1,"query":{"bool":{"filter":[{"term":{"pmid":"123"}}]}}}';

By keyword

curl -XGET 'localhost:9200/medline/_search?pretty' -H 'Content-Type: application/json' -d '{"size":10,"query":{"term":{"abstract":"pcsk9"}}}';

ngrams

Setup nltk

pip install -r requirements.txt

Download things

python -m nltk.downloader punkt stopwords averaged_perceptron_tagger wordnet

Times:

time python3 parse_bulk_ngrams.py /data/medline/daily/pubmed19n1074.xml


real	5m27.050s
user	5m24.217s
sys	0m10.400s

Run

ls data/daily/*.xml | xargs -n 1 -P 40 python3 parse_bulk_ngrams.py

Index: (2 days)

time ls data/ngrams/*.txt.gz | xargs -n 1 -P 20 python index_bulk_ngrams.py > index-ngrams.log 2>index-ngrams.err

real    3076m23.424s
user    5716m56.571s
sys     417m25.828s

elswob / medline