This project provides the code for our JICAI 2020 A Label Attention Model for ICD Coding from Clinical Text paper.
The general architecture and experimental results can be found in our paper:
@inproceedings{ijcai2020-461-vu,
title = {A Label Attention Model for ICD Coding from Clinical Text},
author = {Vu, Thanh and Nguyen, Dat Quoc and Nguyen, Anthony},
booktitle = {Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, {IJCAI-20}},
pages = {3335--3341},
year = {2020},
month = {7},
note = {Main track}
doi = {10.24963/ijcai.2020/461},
url = {https://doi.org/10.24963/ijcai.2020/461},
}
Please CITE our paper when this code is used to produce published results or incorporated into other software.
- python>=3.6
- torch==1.4.0
- scikit-learn==0.23.1
- numpy==1.16.3
- scipy==1.2.1
- pandas==0.24.2
- tqdm==4.31.1
- nltk>=3.4.5
- psycopg2==2.7.7
- gensim==3.6.0
- transformers==2.11.0
Run pip install -r requirements.txt
to install the required libraries
Run python3
and run import nltk
and nltk.download('punkt')
for tokenization
data/mimicdata/mimic3
- The id files are from caml-mimic
- Install the MIMIC-III database with PostgreSQL following this instruction
- Generate the train/valid/test sets using
src/util/mimiciii_data_processing.py
. (Configure the connection to PostgreSQL at Line 139)
data/mimicdata/mimic2
- Place the MIMIC-II file (MIMIC_RAW_DSUMS) to
data/mimicdata/mimic2
- Generate the train/valid/test sets using
src/util/mimicii_data_processing.py
.
Note that: The code will generate 3 files (train.csv
, valid.csv
, and test.csv
) for each experiment.
data/embeddings
We used gensim
to train the embeddings (word2vec
model) using the entire MIMIC-III discharge summary data.
Our code also supports subword embeddings (fastText
) which helps produce better performances (see src/args_parser.py
).
The problem and associated configurations are defined in configuration/config.json
. Note that there are 3 files in each data folder (train.csv
, valid.csv
and test.csv
)
There are common hyperparameters for all the models and the model-specific hyperparameters. See src/args_parser.py
for more detail
Here is an example of using the framework on MIMIC-III dataset (full codes) with hierarchical join learning
python -m src.run \
--problem_name mimic-iii_2_full \
--max_seq_length 4000 \
--n_epoch 50 \
--patience 5 \
--batch_size 8 \
--optimiser adamw \
--lr 0.001 \
--dropout 0.3 \
--level_projection_size 128 \
--main_metric micro_f1 \
--embedding_mode word2vec \
--embedding_file data/embeddings/word2vec_sg0_100.model \
--attention_mode label \
--d_a 512 \
RNN \
--rnn_model LSTM \
--n_layers 1 \
--bidirectional 1 \
--hidden_size 512