Repository for Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019)
The Clinical BERT models can be downloaded here, or via
wget -O pretrained_bert_tf.tar.gz https://www.dropbox.com/s/8armk04fu16algz/pretrained_bert_tf.tar.gz?dl=1
biobert_pretrain_output_all_notes_150000
corresponds to Bio+Clinical BERT, and biobert_pretrain_output_disch_100000
corresponds to Bio+Discharge Summary BERT. Both models are finetuned from BioBERT.
To reproduce the steps necessary to finetune BERT or BioBERT on MIMIC data, follow the following steps:
- Run
format_mimic_for_BERT.py
- Note you'll need to change the file paths at the top of the file. - Run
create_pretrain_data.sh
- Run
finetune_lm_tf.sh
Note: See issue #4 for ways to improve section splitting code.
To see an example of how to use clinical BERT for the Med NLI tasks, go to the run_classifier.sh
script in the downstream_tasks folder.
Please post a Github issue or contact emilya@mit.edu if you have any questions.
Please cite our arXiv paper:
@article{alsentzer2019publicly,
title={Publicly available clinical BERT embeddings},
author={Alsentzer, Emily and Murphy, John R and Boag, Willie and Weng, Wei-Hung and Jin, Di and Naumann, Tristan and McDermott, Matthew},
journal={arXiv preprint arXiv:1904.03323},
year={2019}
}