Classifying medical notes into standard disease codes

August 2017

This repository contains the code I implemented to classify automatically EHR patient discharge notes into standard disease labels (ICD9 codes). I implemented deep learning models ( CNN, LSTM and Hierarchical models) using embeddings and attention layers. The CNN model with attention outperformed previous algorithms used in this task.
The dataset used for modeling was: MIMIC III dataset .

The code was implemented on August 2017, during my graduate studies at the Master of Information and Data Science (MIDS) program at UC Berkeley. The class was: W266 Natural Language Processing with Deep Learning

This is the final project report: w266FinalReport_ICD_9_Classification.pdf

(note: code refactoring pending)

Preprocessing

Getting information from database, pulling data, filtering and joining tables: Pre processing

Main Notebooks

Classification into top level codes in the ICD-9 hierarchy with 5K records

Model	ICD 9 code level	N. Records	Epochs	Notebook
Baseline	First-Level	5K	-	pipeline/icd9_lstm_cnn_workbook.ipynb Section: "Super Basic Baseline with top 4" Always predict top 4 icd-9 codes, F1-score= 52.6
CNN Replication	First-Level	5K	20	pipeline/icd9_lstm_cnn_workbook.ipynb Section: "CNN running with 20 epochs". CNN model to replicate results from paper: Comparing Rule-Based and Deep Learning Models for Patient Phenotyping.In order to compare F1 performance results, I took into consideration the dataset size and number of classes. F1-score= 76.2
CNN	Firs-Level	5K	5	pipeline/icd9_lstm_cnn_workbook.ipynb Section: "CNN running with 5 epochs" running with the 17 first level ICD-9 codes, using 5 epochs and Embeddings. F1-score= 69.1
LSTM	First-Level	5K	5	pipeline/icd9_lstm_cnn_workbook.ipynb Section "Basic LSTM" running with the 17 first level ICD-9 codes, using 5 epochs and Embeddings. F1-score= 64.6

Attention
The average length of discharge clinical notes is 1639 words. The text to classify may be too long for a LSTM or CNN to remember all relevant information. Raffel et al. (2016) displayed better performance in many NLP tasks on long text using Attention. Here, we seek to emulate his results by implementing algorithms based on the formulas presented in Raffel et al. (2016) and Yang et al. (2016).

Model	ICD 9 code level	N. Records	Epochs	Notebook
LSTM with Attention	First-Level	5K	5	pipeline/icd9_lstm_cnn_workbook.ipynb Section: "LSTM with Attention" F1-score:67.0
CNN with Attention	First-Level	5K	5	pipeline/icd9_cnn_att_workbook.ipynb F1-score:72.8
Hierarchical LSTM Attention	First-level	5k	5	pipeline/icd9_hatt_workbook.ipynb This model was implemented based on Yang et al. (2016) which specifically targets document classifications. It has two levels of attention mechanisms, the first one creates vectors that represent each sentence, using attention mechanism across words; and the second level creates a vector that represent the document using attention mechanisms across sentences. F1-score: 67.6

Classification into most common ICD-9 Codes in the bottom of the ICD-9 Hierarchy (leaves)

Model	ICD 9 code level	N. Records	Epochs	Notebook
Baseline	First-Level	46K and 5K	-	baseline/mimic_icd9_baseline.ipynb - Some Initial Exploration with Python and Sql - Basic Baseline Model: For the basic baseline, we make a fixed prediction corresponding to the top 4 ICD-9 codes for 46K records - NN Baseline Model: A neural network (not Recurrent) with one hidden layer, with relu activation on the hidden layer and sigmoid activation on the output layer. Using cross entropy loss,which is the loss functions for multilabel classification (using Tensorflow), using 5K records. F1-score: 35
CNN for top 20 leaf icd-9 codes	Leaf	46K	7	icd9_cnn/cnn_top20_leave.ipynb Classifies clinical notes into the 20 most common ICD-9 that are in the bottom of the ICD-9 hierarchy (leaves), this run was for comparison with previous work. F1-score:72.4

Classification into top level codes in the ICD-9 hierarchy with 52.6K records

Model	ICD 9 code level	N. Records	Epochs	Notebook
CNN	First-Level	52.6K	-	pipeline/icd9_cnn_50K_run.ipynb F1-score: 79.7
CNN with Attention	First-Level	52.6K	-	pipeline/icd9_cnn_att_50K_records.ipynb F1-score: 78.2.At this stage, the CNN ATT model still overfits: even though it had the highest score during the experimental runs with 5K records and 5 epochs, it didn’t reach the best f1-score when running it with the full data set. Further work would explore hyper-parameters tuning and evaluating the number of parameters to attempt undoing the over fitting situation.

Model Python modules

Model	Python module
LSTM	pipeline/lstm_model.py
CNN	pipeline/icd9_cnn_model.py
Attention Layer	pipeline/attention_util.py
LSTM_ATT	pipeline/icd9_lstm_att_model.py
CNN_ATT	pipeline/icd9_cnn_att.py
Hierarchical LSTM Attention	pipeline/hatt_model.py

Helper classes for Preprocessing

Helper	Python module
Filtering clinical-notes to keep the ones that have been assigned the top common N icd-9 codes (this is a multi-label), removing any code from the label that is not in the top N	pipeline/database_selection.py
Three main methods: (1) Splits input file in training, valiation and test (2) Replace leaf icd9-code with its grandparent in the first level (3) Calculates and Diplay F1 scores for a set of possible thresholds	pipeline/helpers.py
functions necessary to vectorize the ICD labels and text inputs (I didn't implement this module, is listed here because it is used by the notebooks I had implemented)	pipeline/vectorization.py

9th-digit-contracting-limited / AI_MedicalNotes