BioALBERT

Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

This repository provides the pre-trained BioALBERT models, a biomedical language representation model trained on large domain specific (biomedical) corpora for designed for biomedical text mining tasks. Please refer to our paper [https://arxiv.org/abs/2107.04374] for more details.

Download

We provide eight versions of pre-trained weights. Pre-training was based on the original ALBERT code, and training details are described in our paper (To be Published). Currently available versions of pre-trained weights are as follows:

BioALBERT-Base v1.0 (PubMed) - based on ALBERT-base Model
BioALBERT-Base v1.0 (PubMed + PMC) - based on ALBERT-base Model
BioALBERT-Base v1.0 (PubMed + MIMIC-III) - based on ALBERT-base Model
BioALBERT-Base v1.0 (PubMed + PMC + MIMIC-III) - based on ALBERT-base Model
BioALBERT-Large v1.1 (PubMed) - based on ALBERT-Large Model
BioALBERT-Large v1.1 (PubMed + PMC) - based on ALBERT-Large Model
BioALBERT-Large v1.1 (PubMed + MIMIC-III) - based on ALBERT-Large Model
BioALBERT-Large v1.1 (PubMed + PMC + MIMIC-III) - based on ALBERT-Large Model

Make sure to specify the version of the pre-trained weights used in your work.

Installation

The following sections introduce the installation and fine-tuning process of BioALBERT based on PyTorch (python version <= 3.7).

To fine-tune BioALBERT, you need to download BioALBERT pre-training weights. After downloading the pre-trained weights, install BioALBERT using requirements.txt as follows:

git clone https://github.com/usmaann/BioALBERT.git
cd BioALBERT; pip install -r requirements.txt

Note that this repository is based on the ALBERT repository by Google. See requirements.txt for other details.

Quick Links

Link	Detail
Paper	https://arxiv.org/abs/2107.04374 with [BibTex]

  (@misc{naseem2021benchmarking,
  title={Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT}, 
  author={Usman Naseem and Adam G. Dunn and Matloob Khushi and Jinman Kim},
  year={2021},
  eprint={2107.04374},
  archivePrefix={arXiv},
  primaryClass={cs.CL}

}) |

Datasets

We provide a pre-processed version of benchmark datasets for each task as follows:

Named Entity Recognition (NER)

Relation Extraction (RE)

Euadr
GAD

Question Answering (BioASQ)

BioASQ 4b
BioASQ 5b
BioASQ 6b

Open each links and download the datasets you need. For BioASQ datasets, please refer to the biobert repository

Fine-tuning BioBERT

After downloading one of the pre-trained weights, unzip it to any directory you want, we will denote it as $BIOALBERT_DIR. For example, when using BioALBERT-Base v1.0 (PubMed), set the BIOALBERT_DIR environment variable to:

$ export BIOALBERT_DIR=./BioALBERT_PUBMED_BASE
$ echo $BIOALBERT_DIR
>>> ./BioALBERT_PUBMED_BASE

NER

Each datasets contains four files, which are dev.tsv, test.tsv, train_dev.tsv, and train.tsv. Simply download a dataset from NER and put these files into the directory called $NER_DIR. Also, set $OUTPUT_DIR as a directory for NER outputs. For example, when fine-tuning on the BC2GM dataset,

$ export NER_DIR=./datasets/NER/BC2GM
$ export OUTPUT_DIR=./NER_outputs

Following command runs fine-tuning code on NER with default arguments.

$ mkdir -p $OUTPUT_DIR
$ python run_ner.py --do_train=true --do_eval=true --vocab_file=$BIOALBERT_DIR/vocab.txt --bert_config_file=$BIOALBERT_DIR/bert_config.json --init_checkpoint=$BIOALBERT_DIR/model.ckpt-1000000 --num_train_epochs=10.0 --data_dir=$NER_DIR --output_dir=$OUTPUT_DIR

RE

Each datasets contains there files, which are dev.tsv, test.tsv, and train.tsv. Let $RE_DIR denote the folder of a single RE data set, $TASK_NAME denote the task name (two options: gad, euadr), and $OUTPUT_DIR denote the RE output directory, take GAD as an example:

$ export RE_DIR=./datasets/RE/GAD/1
$ export TASK_NAME=gad
$ export OUTPUT_DIR=./re_outputs_1

Following command runs fine-tuning code on RE with default arguments.

$ python run_re.py --task_name=$TASK_NAME --do_train=true --do_eval=true --do_predict=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --do_lower_case=false --data_dir=$RE_DIR --output_dir=$OUTPUT_DIR

QA

please refer to the biobert repository

Citation

Contact Information

If you have any questions, please submit a Github issue or contact Usman Naseem (usman.naseem@sydney.edu.au)

usmaann / BioALBERT