BIOS NER

Overview
Repo Contents
System Requirements
Installation Guide
Demo
License
Issues
Citation

Overview

The NER model employed in BIOS is the standard BERT sequential classifier. Our approach of using automated annotation to generate training data belongs to the distant supervision (DS-NER) category.

Repo Contents

pretrain: scripts for downloading pretrained models
example: small dataset to demo the code
preprocess: prepare training data using automated annotation
train: train codes
predict: predict codes
utils: utils codes

System Requirements

Hardware Requirements

For optimal performance, we recommend a computer with following specs:

RAM: 16+ GB
CPU: 4+ cores, 3.3+ GHz/core
GPU Memory: 40 GB

Software requirements

The package is tested on Linux 20.04 operating system. Python3.6 is recommended! For python3.7, see issues 14559.

Install requirements

!pip install -r requirements.txt

Installation Guide

Download cleanterms5.txt (password: d9ol) and put it under ./example/cleanterms/.
Download pretrained models: cd pretrain && sh download_pubmedbert.sh
which will take about a few minutes to complete the download.

Demo

Preprocess

Prepare train/dev/test datasets.
Make sure cleanterms5.txt has been downloaded, and then:

1. cd preprocess
2. bash run.sh

Expect result: train/dev/test.txt under ../example/data/ner_train Expect time: about 2-3 minutes.

Train

Make sure pretrained model has been downloaded, and then:

1. cd train
2. sh train.sh

Expect result: fine-tuned model under ../example/data/ner_train/output
Expect time cost: about 5 minutes.
The evaluation result may be not good due to small training datasets.

Predict

configure fine-tuned model path (password: igg3) in the predict.sh, and then:

1. cd predict
2. bash predict.sh

Expect result: predictions for NER under ../example/data/ner_predict
Expect time cost: about 1 minute.

Instructions for use

generate your cleanterms.txt from UMLS by your rules.
prepare your training texts and the STA results using STA model.
use your datasets to train and predict!

Citation

@misc{https://doi.org/10.48550/arxiv.2203.09975,
  doi = {10.48550/ARXIV.2203.09975},
  url = {https://arxiv.org/abs/2203.09975},
  author = {Yu, Sheng and Yuan, Zheng and Xia, Jun and Luo, Shengxuan and Ying, Huaiyuan and Zeng, Sihang and Ren, Jingyi and Yuan, Hongyi and Zhao, Zhengyun and Lin, Yucong and Lu, Keming and Wang, Jing and Xie, Yutao and Shum, Heung-Yeung},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {BIOS: An Algorithmically Generated Biomedical Knowledge Graph},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}

xiaj1011 / bios_ner