This repository contains implementations of various CRF-LSTM models. More details about these models can be found in the paper: Structured Prediction Models for RNN based Sequence Labeling in clinical text, EMNLP 2016.
git clone https://github.com/abhyudaynj/LSTM-CRF-models.git
The original code for the paper was written in Theano and Numpy. This is a slightly more optimized version of the original code with all the main computations entirely in Theano\Lasagne. It also has additional support for incorporating handcrafted features and UMLS Semantic Types. UMLS Semantic types can be extracted by using the MetaMap software. MetaMap requires a UMLS License. To run without UMLS, keep the -umls option to 0.
This code is build on the excellent work done by Theano and Lasagne folks. For other necessary packages see requirements file.
It is preferable to train the models on GPUs, in the interest of a managable training time.
After installing the dependencies and cloning the repository, please use the following steps.
A toy sample dataset is provided in sample-data/ containing all possible types of input files. Files like 001 are raw text files. Only raw text files are needed to run the trained models on your data. To train the tagger model, you also need annotation files like 001.json. Annotation files are json files which contain list of [start char offset, end char offset, annotated text, annotation type, annotation id] objects. Please take a look at the functions file_extractor and annotated_file_extractor in bionlp/preprocess/extract_data.py for more details. Files like 001.umls.json contain MetaMap annotations for each file. These types of files are needed for training and deployement when umls option is set to 1. The model provided has this feature off, so you only need raw text files in your directory.
The tagger takes as input a file which contains a list of all raw text files to be processed. To generate this file use the following
python scripts/get_file_list.py -i [input directory] -o [file-list-output] -e -1
For option use
python scripts/get_file_list.py -h
Trained model file can be obtained at
wget http://www.bio-nlp.org/external_user_uploads/skip-crf-approx.pkl
UPDATE : Model file will be updated as various training runs for the deployment model finish. The latest model files will be updated at the same url.
Use the following statement to run the tagger on all the files in the file-list-output. The tagger will populate the output directory with json files, containing the predicted annotations.
python deploy.py -i [file-list-output] -d [output-directory] -model [model-file]
For training the model, each "filename" in the input directory should have a "filename.json" in the same directory. Please take a look at the function annotated_file_extractor in bionlp/preprocess/extract_data.py. This is the function that will be extracting the raw text and annotations from your input files. You only need to provide annotations for the relevant labels, 'Outside' labels need not be provided. The model automatically assigns 'None' label to any token without an annotation. If you have set 'umls' option on, you need also *.umls.json files.
To run the training you also need a dependency.json file if you want to provide your own embeddings. A sample file is provided in dependency.json.sample. This file contains dependency paths. To initialize with your own embeddings,'mdl' in dependency.json should point to the binary model of wordvectors generated by Word2Vec software.
Once you have the input-file-list with annotation .json files in place and dependecies are set. Type the following to start the training
python train_crf_rnn.py -model [output-model-file] -i [file-list-output]
There are multiple parameters to tune. To check the options use
python train_crf_rnn.py -h
.
|-bionlp # Main directory containing the package
|-data # Contains the class definitions for the data format.
|-evaluate # Contains the Evaluation and the Postprocessing Script
|-modifiers # Contains functions to modify the data format, usually by adding feature vectors.
|-preprocess # Contains preprocess functions that extract raw text and annotation files.
|-utils # Misc utility functions.
|-taggers
|-rnn_feature # Contains the main tagger code.
|-networks # Contains the code for CRF-LSTM models. See provided README.md file for details on each model
|-scripts # Utility scripts.
|-sample-data # directory with input data format. json files without the umls extension are annotation files
and only needed for training or evaluation during deploy. *.umls.json files are metamap annotation files