usmaann / BioALBERT

Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT

This repository provides the pre-trained BioALBERT models, a biomedical language representation model trained on large domain specific (biomedical) corpora for designed for biomedical text mining tasks. Please refer to our paper [] for more details.


We provide eight versions of pre-trained weights. Pre-training was based on the original ALBERT code, and training details are described in our paper (To be Published). Currently available versions of pre-trained weights are as follows:

  1. BioALBERT-Base v1.0 (PubMed) - based on ALBERT-base Model

  2. BioALBERT-Base v1.0 (PubMed + PMC) - based on ALBERT-base Model

  3. BioALBERT-Base v1.0 (PubMed + MIMIC-III) - based on ALBERT-base Model

  4. BioALBERT-Base v1.0 (PubMed + PMC + MIMIC-III) - based on ALBERT-base Model

  5. BioALBERT-Large v1.1 (PubMed) - based on ALBERT-Large Model

  6. BioALBERT-Large v1.1 (PubMed + PMC) - based on ALBERT-Large Model

  7. BioALBERT-Large v1.1 (PubMed + MIMIC-III) - based on ALBERT-Large Model

  8. BioALBERT-Large v1.1 (PubMed + PMC + MIMIC-III) - based on ALBERT-Large Model

Make sure to specify the version of the pre-trained weights used in your work.


The following sections introduce the installation and fine-tuning process of BioALBERT based on PyTorch (python version <= 3.7).

To fine-tune BioALBERT, you need to download BioALBERT pre-training weights. After downloading the pre-trained weights, install BioALBERT using requirements.txt as follows:

git clone
cd BioALBERT; pip install -r requirements.txt

Note that this repository is based on the ALBERT repository by Google. See requirements.txt for other details.

Quick Links

Link Detail
Paper with [BibTex]
  title={Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT}, 
  author={Usman Naseem and Adam G. Dunn and Matloob Khushi and Jinman Kim},

}) |


We provide a pre-processed version of benchmark datasets for each task as follows:

Named Entity Recognition (NER)

Relation Extraction (RE)

Question Answering (BioASQ)

  • BioASQ 4b
  • BioASQ 5b
  • BioASQ 6b

Open each links and download the datasets you need. For BioASQ datasets, please refer to the biobert repository

Fine-tuning BioBERT

After downloading one of the pre-trained weights, unzip it to any directory you want, we will denote it as $BIOALBERT_DIR. For example, when using BioALBERT-Base v1.0 (PubMed), set the BIOALBERT_DIR environment variable to:



Each datasets contains four files, which are dev.tsv, test.tsv, train_dev.tsv, and train.tsv. Simply download a dataset from NER and put these files into the directory called $NER_DIR. Also, set $OUTPUT_DIR as a directory for NER outputs. For example, when fine-tuning on the BC2GM dataset,

$ export NER_DIR=./datasets/NER/BC2GM
$ export OUTPUT_DIR=./NER_outputs

Following command runs fine-tuning code on NER with default arguments.

$ mkdir -p $OUTPUT_DIR
$ python --do_train=true --do_eval=true --vocab_file=$BIOALBERT_DIR/vocab.txt --bert_config_file=$BIOALBERT_DIR/bert_config.json --init_checkpoint=$BIOALBERT_DIR/model.ckpt-1000000 --num_train_epochs=10.0 --data_dir=$NER_DIR --output_dir=$OUTPUT_DIR


Each datasets contains there files, which are dev.tsv, test.tsv, and train.tsv. Let $RE_DIR denote the folder of a single RE data set, $TASK_NAME denote the task name (two options: gad, euadr), and $OUTPUT_DIR denote the RE output directory, take GAD as an example:

$ export RE_DIR=./datasets/RE/GAD/1
$ export TASK_NAME=gad
$ export OUTPUT_DIR=./re_outputs_1

Following command runs fine-tuning code on RE with default arguments.

$ python --task_name=$TASK_NAME --do_train=true --do_eval=true --do_predict=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --do_lower_case=false --data_dir=$RE_DIR --output_dir=$OUTPUT_DIR


please refer to the biobert repository


Contact Information

If you have any questions, please submit a Github issue or contact Usman Naseem (


Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT