annambiar / PRoBERTa

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PRoBERTa

Ananthan Nambiar, Maeve Heflin, Simon Liu, Sergei Maslov, Mark Hopkins, Anna Ritz

Notes

  • Links to Google Drive folders:

BPE model

pretraining data

family data

conservative ppi data

aggressive data

pretrained weights

protein family finetuned weights

ppi conservative finetuned (20%) weights

ppi conservative finetuned (100%) weights

ppi aggressive finetuned (20%) weights

ppi aggressive finetuned (100%) weights

Requirements and Installation

sentencepiece tokenizer

pip3 install sentencepiece

Build fairseq from linked repo source.

git clone https://github.com/imonlius/fairseq.git
cd fairseq
pip3 install --editable . --no-binary cffi

tokenizer.py

Train a tokenizer and tokenize data for protein family and interaction fine-tuning

Example usage:

python3 tokenizer.py
  • To change
Name Description
path Path to the protein family data. This should be a .tab file with "Sequence" and "Protein families" as two of the columns
int_path Path to protein interaction data. This should be a json file with 'from', 'to' and 'link' for each interaction

pRoBERTa_pretrain.sh

Pre-train RoBERTa model

Example Usage:

bash pRoBERTa_pretrain.sh pretrain 4 pretrained_model \
        pretraining/split_binarized/ \
        768 5 125000 3125 0.0025 32 64 3
  • Arguments
Name Description Example
PREFIX Prefix for the model output files pretrain
NUM_GPUS Number of GPUs to be used during pretraining 4
OUTPUT_DIR Output directory pretrained_model
DATA_DIR Binarized input data directory pretraining/split_binarized/
ENCODER_EMBED_DIM Dimension of embedding generated by the encoders 768
ENCODER_LAYERS Number of encoder layers in the model 5
TOTAL_UPDATES Total (maximum) number of updates during training 125000
WARMUP_UPDATES Total number of LR warm-up updates during training 3125
PEAK_LEARNING_RATE Peak learning rate for training 0.0025
MAX_SENTENCES Maximum number of sequences in each batch 32
UPDATE_FREQ Updates the model every UPDATE_FREQ batches 64
PATIENCE Early stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs 3

pRoBERTa_finetune_ppi.sh:

Fine-tune RoBERTa model for Protein Interaction Prediction Task

Example Usage:

bash pRoBERTa_finetune_ppi.sh ppi 4 ppi_prediction \
        ppi_prediction/split_binarized/robustness_minisplits/0.80/ \
        768 5 12500 312 0.0025 32 64 2 3 \
        pretraining/checkpoint_best.pt \
        no
  • Arguments
Name Description Example
PREFIX Prefix for the model output files ppi
NUM_GPUS Number of GPUs to use for finetuning 4
OUTPUT_DIR Model output directory ppi_prediction
DATA_DIR Binarized input data directory ppi_prediction/split_binarized/robustness_minisplits/1.00
ENCODER_EMBED_DIM Dimension of embedding generated by the encoders 768
ENCODER_LAYERS Number of encoder layers in the model 5
TOTAL_UPDATES Total (maximum) number of updates during training 12500
WARMUP_UPDATES Total number of LR warm-up updates during training 3125
PEAK_LEARNING_RATE Peak learning rate for training 0.0025
MAX_SENTENCES Maximum number of sequences in each batch 32
UPDATE_FREQ Updates the model every UPDATE_FREQ batches 64
PATIENCE Early stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs 3
PRETRAIN_CHECKPOINT Path to pretrained model checkpoint pretraining/checkpoint_best.pt
RESUME_TRAINING Whether to resume training from previous finetuned model checkpoints no

pRoBERTa_finetune_pfamclass.sh:

Fine-tune RoBERTa model for Family Classification Task

Example Usage:

bash pRoBERTa_finetune_pfamclass.sh family 4 family_classification \
        family_classification/split_binarized/robustness_minisplits/1.00 \
        768 5 12500 312 0.0025 32 64 4083 3 \
        pretraining/checkpoint_best.pt \
        no
  • Arguments
Name Description Example
PREFIX Prefix for the model output files family
NUM_GPUS Number of GPUs to use for finetuning 4
OUTPUT_DIR Model output directory family_classification
DATA_DIR Binarized input data directory family_classification/split_binarized/robustness_minisplits/1.00
ENCODER_EMBED_DIM Dimension of embedding generated by the encoders 768
ENCODER_LAYERS Number of encoder layers in the model 5
TOTAL_UPDATES Total (maximum) number of updates during training 12500
WARMUP_UPDATES Total number of LR warm-up updates during training 3125
PEAK_LEARNING_RATE Peak learning rate for training 0.0025
MAX_SENTENCES Maximum number of sequences in each batch 32
UPDATE_FREQ Updates the model every UPDATE_FREQ batches 64
PATIENCE Early stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs 3
PRETRAIN_CHECKPOINT Path to pretrained model checkpoint pretraining/checkpoint_best.pt
RESUME_TRAINING Whether to resume training from previous finetuned model checkpoints no

Clustering/protein_family_clustering_loop.py

Cluster proteins using k-means and calculate the normalized mutual information (NMI) with protein families. Before running this make sure to download roberta.base and the relevant checkpoints.

Example Usage:

python3 protein_family_clustering_loop.py
  • To change
Name Description
tokenized_data_filepath Input data filepath. This file has to contain tokenized protein sequences in a 'Tokenized Sequence' column, and the family each protein belongs to in a 'Protein families' column. Any other columns in this file will be ignored.
roberta_weights depending on whether you're using a pretrained or fine-tuned model, choose the appropriate weights
EMBEDDING_SIZE Should match the PRoBERTa model size
USE_NULL_MODEL Whether to use random cluster prediction instead of k-means clustering

pRoBERTa_evaluate_family_batch.py:

Predict families using fine-tuned RoBERTa model

Example Usage:

python3 pRoBERTa_evaluate_family_batch.py family_classification/split_tokenized/full/Finetune_fam_data.split.test.10 \
	family_classification/split_binarized/robustness_minisplits/1.00/ \
	predictions.tsv \
	family_classification/checkpoints/ \
	protein_family_classification 256
  • Arguments
Name Description Example
DATA Path to input examples to predict. This should be formatted as a CSV with the columns, in order: tokenized sequence, true family label family_classification/split_tokenized/full/Finetune_fam_data.split.test.10
BINARIZED_DATA Path to binarized family data family_classification/split_binarized/robustness_minisplits/1.00/
OUTPUT Path to output file with model predictions predictions.tsv
MODEL_FOLDER Model checkpoints folder. Will use checkpoint_best.pt file in the folder. family_classification/checkpoints/
CLASSIFICATION_HEAD_NAME Name of the trained classification head protein_family_classification
BATCH_SIZE Batch size for prediction 256

pRoBERTa_evaluate_ppi_batch.py:

Predict PPI using fine-tuned RoBERTa model

Example Usage:

python3 pRoBERTa_evaluate_ppi_batch.py ppi_prediction/split_tokenized/full/Finetune_interact_tokenized.split.test.10 \
	ppi_prediction/split_binarized/robustness_minisplits/1.00/ \
	predictions.tsv \
	ppi_prediction/checkpoints/ \
	protein_interaction_prediction 256
  • Arguments:
Name Description Example
DATA Path to input examples to predict. This should be formatted as a CSV with the columns, in order: tokenized from sequence, tokenized to sequence, true label ppi_prediction/split_tokenized/full/Finetune_interact_tokenized.split.test.10
BINARIZED_DATA Path to binarized PPI data ppi_prediction/split_binarized/robustness_minisplits/1.00/
OUTPUT Path to output file with model predictions predictions.tsv
MODEL_FOLDER Model checkpoints folder. Will use checkpoint_best.pt file in the folder. ppi_prediction/checkpoints/
CLASSIFICATION_HEAD_NAME Name of the trained classification head protein_interaction_prediction
BATCH_SIZE Batch size for prediction 256

shuffle_and_split_pretrain.sh:

Shuffle and split pretraining data file into training, validation, and test data files.

Example Usage:

bash shuffle_and_split_pretrain.sh pretraining/tokenized_seqs_v1.txt \
	pretraining/split_tokenized/ \
	tokenized_seqs_v1
  • Arguments:
Name Description Example
INPUT Input file. Each line should be an example. pretraining/tokenized_seqs_v1.txt
OUTPUT Output directory pretraining/split_tokenized/
PREFIX Prefix for output files tokenized_seqs_v1

shuffle_and_split.sh:

Shuffle and split finetuning data file into training, validation, and test data files.

Example Usage:

bash shuffle_and_split.sh family_classification/Finetune_fam_data.csv \
	family_classification/split_tokenized/full/ \
	Finetune_fam_data
  • Arguments:
Name Description Example
INPUT Input file. Each line should be an example. family_classification/Finetune_fam_data.csv
OUTPUT Output directory family_classification/split_tokenized/full/
PREFIX Prefix for output files Finetune_fam_data

percentage_splits.sh

Generate output files with a certain percentage of the input data file

Example Usage:

bash percentage_splits.sh family_classification/split_tokenized/full/Finetune_fam_data.split.train.80 \
	family_classification/split_tokenized/full/robustness_split
	Finetune_fam_data
  • Arguments:
Name Description Example
INPUT Input file family_classification/split_tokenized/full/Finetune_fam_data.split.train.80
OUTPUT Output directory family_classification/split_tokenized/full/robustness_split
PREFIX Prefix for output files Finetune_fam_data

Preprocess/binarize pretraining data:

fairseq-preprocess \
	--only-source \
	--trainpref tokenized_seqs_v1.split.train.80 \
	--validpref tokenized_seqs_v1.split.valid.10 \
	--testpref tokenized_seqs_v1.split.test.10 \
	--destdir pretraining/split_binarized \
	--workers 60

Preprocess/binarize family classification finetuning data:

# Split data into sequence and family files
for f in family_classification/split_tokenized/full/Finetune*; do
	cut -f1 -d',' "$f" > family_classification/split_tokenized/sequence/$(basename "$f").sequence
	cut -f2 -d',' "$f" > family_classification/split_tokenized/family/$(basename "$f").family
done

# Replace all spaces in family names with underscores
for f in family_classification/split_tokenized/family/*.family; do
	sed -i 's/ /_/g' "$f"
done

# Generate family label dictionary file
awk '{print $0,0}' family_classification/split_tokenized/family/*.family | sort | uniq > \
	family_classification/split_tokenized/family/families.txt

# Binarize sequences
fairseq-preprocess \
	--only-source \
	--trainpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.train.80.sequence
        --validpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.valid.10.sequence
        --testpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.test.10.sequence
	--destdir family_classification/split_binarized/input0
	--workers 60
	--srcdict pretraining/split_binarized/dict.txt

# Binarize labels
fairseq-preprocess \
	--only-source \
	--trainpref family_classification/split_tokenized/family/Finetune_fam_data.split.train.80.family
	--validpref family_classification/split_tokenized/family/Finetune_fam_data.split.valid.10.family
	--testpref family_classification/split_tokenized/family/Finetune_fam_data.split.test.10.family 
	--destdir family_classification/split_binarized/label
	--workers 60
	--srcdict family_classification/split_tokenized/family/families.txt

Preprocess/binarize PPI data:

# Split data into from sequence, to sequence, and label files
for f in ppi_prediction/split_tokenized/full/Finetune*; do
        cut -f1 -d',' "$f" > ppi_prediction/split_tokenized/from/$(basename "$f").from
        cut -f2 -d',' "$f" > ppi_prediction/split_tokenized/to/$(basename "$f").to
	cut -f2 -d',' "$f" > ppi_prediction/split_tokenized/label/$(basename "$f").label
done

# Binarize sequences
fairseq-preprocess \
        --only-source \
        --trainpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.train.80.from
        --validpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.valid.10.from
        --testpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.test.10.from
        --destdir ppi_prediction/split_binarized/input0
        --workers 60
        --srcdict pretraining/split_binarized/dict.txt

fairseq-preprocess \
        --only-source \
        --trainpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.train.80.to
        --validpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.valid.10.to
        --testpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.test.10.to
        --destdir ppi_prediction/split_binarized/input1
        --workers 60
        --srcdict pretraining/split_binarized/dict.txt

# Binarize labels
fairseq-preprocess \
	--only-source \
	--trainpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.train.80.label
        --validpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.valid.10.label
        --testpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.test.10.label
	--destdir ppi_prediction/split_binarized/label
	--workers 60

About


Languages

Language:Python 51.6%Language:Shell 48.4%