MedNLI Baseline

A simple baseline for Natural Language Inference in clinical domain using the MedNLI dataset. Includes simplified CBOW and InferSent models from the corresponding paper.

Installation

Clone this repo: git clone https://github.com/jgc128/mednli_baseline.git
Install NumPy: pip install numpy==1.15.2
Install PyTorch v0.4.1: pip install http://download.pytorch.org/whl/cu92/torch-0.4.1-cp36-cp36m-linux_x86_64.whl (see https://pytorch.org/ for details)
Install requirements: pip install -r requirements.txt

Downloading the dataset, word embeddings, and pre-trained models

Create the ./data directory inside the cloned repository
1. Create the ./data/cache directory
Download MedNLI: https://jgc128.github.io/mednli/
1. Extract the content of the mednli_data.zip archive into the ./data/mednli dir (unzip -d data/mednli mednli_data.zip)
Download word embeddings (see the table below) and put the *.pickled files into the ./data/word_embeddings/ dir (wget -P data/word_embeddings/ https://mednli.blob.core.windows.net/shared/word_embeddings/https://mednli.blob.core.windows.net/shared/word_embeddings/mimic.fastText.no_clean.300d.pickled)
Download pre-trained models (see below) and put the *.pkl and the *.pt files into the ./data/models/ dir

Word embeddings

Word Embedding	Link
glove	glove.840B.300d.pickled
mimic	mimic.fastText.no_clean.300d.pickled
bio_asq	bio_asq.no_clean.300d.pickled
wiki_en	wiki_en.fastText.300d.pickled
wiki_en_mimic	wiki_en_mimic.fastText.no_clean.300d.pickled
glove_bio_asq	glove_bio_asq.no_clean.300d.pickled
glove_bio_asq_mimic	glove_bio_asq_mimic.no_clean.300d.pickled

Models

Model	Embeddings	MedNLI Dev accuracy	Files
CBOW	mimic	0.670	model spec / model weights
InferSent	glove	0.743	model spec / model weights
InferSent	mimic	0.783	model spec / model weights
InferSent	wiki_en	0.763	model spec / model weights
InferSent	wiki_en_mimic	0.774	model spec / model weights
InferSent	glove_bio_asq_mimic	0.770	model spec / model weights

Using a pre-training model

Run the predict.py file with three arguments:

Path to the model specification file (*.pkl)
Input file in the jsonl format (see mli_dev_v1.jsonl) or the \t-separated premise and hypothesis (see test_input.txt)
Output file .csv to save predicted probabilities of each of the three classes (contradiction, entailment, and neutral)

Notes:

The model weights file (*.pt) should be located in the same dir as the model specification file (*.pkl)
In case of the jsonl format the sentences are taken from the sentence1_binary_parse and sentence2_binary_parse fields, where the sentence1 is the premise and sentence2 is the hypothesis. All other fields are optional

Example command to run the prediction:

python predict.py data/models/mednli.infersent.mimic.128.saek2t5q.pkl data/input_test.txt data/predictions_test.csv

Training the model

Run the train.py file. The options are set in the config.py file. Command-line interface is coming soon! By default, the model specification and the model weights are saved in the ./data/models dir.

Training the feature based system

To run a traditional feature based system, run the train_feature_based.py file. This system achieves 0.523 accuracy on the dev set using a gradient boosting classifier with features based on word overlaps, tf-idf similarities, word embeddings similarities, and blue scores.

Reference

Romanov, A., & Shivade, C. (2018). Lessons from Natural Language Inference in the Clinical Domain. arXiv preprint arXiv:1808.06752.
https://arxiv.org/abs/1808.06752

@article{romanov2018lessons,
	title = {Lessons from Natural Language Inference in the Clinical Domain},
	url = {http://arxiv.org/abs/1808.06752},
	abstract = {State of the art models using deep neural networks have become very good in learning an accurate mapping from inputs to outputs. However, they still lack generalization capabilities in conditions that differ from the ones encountered during training. This is even more challenging in specialized, and knowledge intensive domains, where training data is limited. To address this gap, we introduce {MedNLI} - a dataset annotated by doctors, performing a natural language inference task ({NLI}), grounded in the medical history of patients. We present strategies to: 1) leverage transfer learning using datasets from the open domain, (e.g. {SNLI}) and 2) incorporate domain knowledge from external data and lexical sources (e.g. medical terminologies). Our results demonstrate performance gains using both strategies.},
	journaltitle = {{arXiv}:1808.06752 [cs]},
	author = {Romanov, Alexey and Shivade, Chaitanya},
	urldate = {2018-08-27},
	date = {2018-08-21},
	eprinttype = {arxiv},
	eprint = {1808.06752},
}

jgc128 / mednli_baseline