This repository contains the source code for KDD 2023 paper "Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design". If you have questions, don't hesitate to open an issue or ask me via im_kai@hust.edu.cn or Lijun Wu via lijuwu@microsoft.com. We are happy to hear from you!
Schematic illustration of the ABGNN framework
The AbBERT is the pre-trained antibody model. Its `soft' prediction will be fed into the sequence GNN
- pytorch==1.12.0
- fairseq==0.10.2
- numpy==1.23.3
We collected all paired and unpaired data from OAS Database using the provided scripts. We extracted the antibody sequences along with their CDR tags. Then we randomly split the dataset into three subsets: 1000 for validation, 1000 for testing, and the remaining for training. After processing, we obtained the following files: seq.train.tokens
, seq.valid.tokens
, seq.test.tokens
and corresponding tag.train.tokens
, tag.valid.tokens
, tag.test.tokens
. Finally, we preprocess these files into fairseq binary files using following scripts.
bash pretrain-preprocess.sh
The processed fairseq tokens can be downloaded at abgnn/fairseq-oas-50m
When training, we can run:
bash pretrain-abbert.sh
The pre-trained model checkpoints can be downloaded at this link
For experiment 1, we refer to the preprocessing scripts in MEAN and convert it to jsonl files, similar to experiment 2. For experiment 2, We directly use data from HSRN. For experiment 3, we follow the setting in RefineGNN.
Notebly, in experiment 3, we have to finetune on dataset abgnn/finetune/exp3-sabdab and use the saved model to further finetune with script covid-optimize.sh
. Since the dataset only contain antibody, we have to use a version without antigen encoding.
The processed finetune dataset in jsonl format can be downloaded at abgnn/finetune, where the 10-fold validation split in MEAN is also provided in abgnn/finetune/mean_exp1_for_abgnn
.
The finetuning scripts are following:
# for exp1
bash finetune-exp1.sh
# for exp2
bash finetune-exp2.sh
# for exp3
# have to additionally install pytorch_lightning, matplotlib, and igfold
bash finetune-exp3.sh
bash covid-optimize.sh
We can simply run the following code for inference:
python inference.py \
--cdr_type ${CDR} \
--cktpath ${model_ckpt_path}}/checkpoint_best.pt \
--data_path ${dataset_path}
This work is under MIT License
If you find this code useful in your research, please consider citing:
@inproceedings{gao2023pre,
title={Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design},
author={Gao, Kaiyuan and Wu, Lijun and Zhu, Jinhua and Peng, Tianbo and Xia, Yingce and He, Liang and Xie, Shufang and Qin, Tao and Liu, Haiguang and He, Kun and others},
booktitle={Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages={506--517},
year={2023}
}