dimbage / BioGPT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BioGPT

This repository contains the implementation of BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu.

Requirements and Installation

  • PyTorch version == 1.12.0
  • Python version == 3.10
  • fairseq version == 0.12.0:
git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace
cd ..
  • Moses
git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=${PWD}/mosesdecoder
  • fastBPE
pip install fastBPE
git clone https://github.com/glample/fastBPE.git
export FASTBPE=${PWD}/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
  • sacremoses
pip install sacremoses
  • sklearn
pip install scikit-learn

Remember to set the environment variables MOSES and FASTBPE to the path of Moses and fastBPE respetively, as they will be required later.

Getting Started

Pre-trained models

We provide our pre-trained BioGPT model checkpoint along with fine-tuned checkpoints for downstream tasks here. Download it and extract it the root of this project.

wget https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints.tgz
tar -zxvf checkpoints.tgz

Note:

If you encounter connection issue during downloading, it may be caused by the large model file size (22GB). We splitted the large files into pieces with each being 1GB. You can donwload them and use cat to concatenate them together:

# download all the splitted files from above.
for i in `seq 0 21`; do wget `printf "https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints.tgz.%02d" $i`; done
# cat them together and decompress
cat checkpoints.tgz.* | tar -zxvf -

It contains following folders:

  • Pre-trained-BioGPT: The pre-trained BioGPT model checkpoint
  • RE-BC5CDR-BioGPT: The fine-tuned checkpoint for relation extraction task on BC5CDR
  • RE-DTI-BioGPT: The fine-tuned checkpoint for relation extraction task on KD-DTI
  • RE-DDI-BioGPT: The fine-tuned checkpoint for relation extraction task on DDI
  • DC-HoC-BioGPT: The fine-tuned checkpoint for document classification task on HoC
  • QA-PubMedQA-BioGPT: The fine-tuned checkpoint for question answering task on PubMedQA

Example Usage

Use pre-trained BioGPT model in your code:

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        "checkpoints/Pre-trained-BioGPT", 
        "checkpoint.pt", 
        "data",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        min_len=100,
        max_len_b=1024)
m.cuda()
src_tokens = m.encode("COVID-19 is")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)

Use fine-tuned BioGPT model on KD-DTI for drug-target-interaction in your code:

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
        "checkpoints/RE-DTI-BioGPT", 
        "checkpoint_avg.pt", 
        "data/KD-DTI/relis-bin",
        tokenizer='moses', 
        bpe='fastbpe', 
        bpe_codes="data/bpecodes",
        max_len_b=1024,
        beam=1)
m.cuda()
src_text="" # input text, e.g., a PubMed abstract
src_tokens = m.encode(src_text)
generate = m.generate([src_tokens], beam=args.beam)[0]
output = m.decode(generate[0]["tokens"])
print(output)

For more downstream tasks, please see below.

Downstream tasks

See corresponding folder in examples:

License

BioGPT is MIT-licensed. The license applies to the pre-trained models as well.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

License:MIT License


Languages

Language:Python 98.8%Language:Shell 1.2%