RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins

Introduction

RNA binding proteins (RBPs) are crucial in the post-transcriptional control of RNAs and play vital roles in a myriad of biological processes, such as RNA localization and gene regulation. Therefore, computational methods that are capable of accurately identifying RBPs are highly desirable and have important implications for biomedical and biotechnological applications. Here we propose a two-stage deep transfer learning-based framework, termed RBP-TSTL, for accurate prediction of RBPs. In the first stage, the knowledge from the self-supervised pre-trained model was utilized for feature embeddings to represent the protein sequence, while in the second stage, a customized deep learning model was initialized based on an annotated pre-training RBPs dataset before being fine-tuned on each corresponding target species dataset. This two-stage transfer learning framework can enable the RBP-TSTL model to be effectively trained to learn and improve the prediction performance. Extensive performance comparison between the RBP-TSTL models trained using the features generated by the self-supervised pre-trained model and other models trained using hand-crafting encoding features demonstrated the effectiveness of the proposed two-stage knowledge transfer strategy based on the self-supervised pre-trained models. Using the best-performing RBP-TSTL models, we further conducted genome-scale RBP predictions for Homo sapiens, Arabidopsis thaliana, Escherichia coli, and Salmonella and established a computational compendium containing all the predicted putative RBPs candidates. We anticipate that the proposed RBP-TSTL approach will be explored as a useful tool for the characterization of RNA-binding proteins and exploration of their sequence-structure-function relationships.

Code details

Users can run the model_inference.py to identify the RBPs.
generate_embeddings.py is implemented for emebeddings generation.
train.py is implemented for re-training the customized deep learning model from scratch.

Dependency

python 3.8
torch 1.7.1
cuda 11.0
scikit_learn 0.22.2
SentencePiece
transformers

Dataset

The dataset for re-training the RBP-TSTL model can be downloaded from the RBPs datasets, which includes these files:

Fasta files of the sequences: pretrain_accending_trP2392_trN38582_VaP292_VaN4881_TeP298_TeN4889.fasta, 9606_accending_trP1170_trN8485_VaP126_VaN942_TeP178_TeN1202.fasta, 3701_accending_trP437_trN5574_VaP43_VaN695_TeP87_TeN1071.fasta, 561_accending_trP351_trN2819_VaP38_VaN313_TeP52_TeN378.fasta and 590_accending_trP206_trN1107_VaP22_VaN123_TeP31_TeN142.fasta. In the titles of fasta files, the taxonomy ID indicates the species, for example 9606 represents Homo Sapiens, 3701 Arabidopsis thaliana, 561 Escherichia coli, and 590 Salmonella. "accending" means the sequences in the fasta file were named in the ascending order, like seq_0, seq_1, seq_2..., trP, trN, VaP, VaN, TeP, TeN indicate the number of postive samples in training set, negative samples in training set, positive samples in validation set, negative samples in validation set, positive samples in testing set and negative samples in testing set.
The csvs files are the labels for the protein sequences which indicate whether they are RBPs or non-RBPs. The meaning of the titles are the same as the Fasta files.

Installation Guide

Install from Github

git clone https://github.com/Xinxinatg/RBP-TSTL
cd RBP-TSTL
pip install -r requirements.txt

Steps for re-training the model for genome-scale prediction of RBPs:

Download the RBPs datasets and the embeddings generated by Prot-T5.

Run the code

Initialize the customized deep learning model on the annotated pre-training dataset

python train.py     --pro_label_dir 'pretrain_accending_trP2392_trN38582_VaP292_VaN4881_TeP298_TeN4889_pep_label.csv'   \ 
                    --rep_dir     'prot_t5_xl_uniref50_pretrain.csv'    \
                    --batch_size  2048       \
                    --epoch 250

Fine-tune the customized deep learning model on the dataset of target species (Taking Homo Sapiens as example)

python train.py     --pro_label_dir '9606_accending_trP1170_trN8485_VaP126_VaN942_TeP178_TeN1202_pep_label.csv'   \ 
                    --rep_dir     'prot_t5_xl_uniref50_9606.csv'    \
                    --batch_size  2048       \
                    --load_model_dir pretrained_model.pl   \
                    --epoch 250

Steps for identifying potential RBPs on 4 species using trained models:

Download the trained models

Run the code
- Generating embeddings of the protein sequences of the potential RBPs using ProtT5, the output will be a csv file titled "features_mean.csv":
```
python generate_embeddings.py [fasta file of the sequences of potential RBPs]
```
- Loading trained model and printing confidence level according to the order of sequences in fasta file (Taking Homo Sapiens as example):
```
python model_inference.py       --species '9606'   \ 
                                --rep_dir     'features_mean.csv'    \
                                --model_dir '9606_model.pl'   
```

Download of genomic scale prediction results of RBPs on 4 species:

Alternatively, the genomic scale prediction results of RBPs on 4 species can be downloaded directly from google drive.

Update:

As per issue 1, there are a few duplicate entries in the pre-training datasets.

Reference

Peng, Xinxin, et al. "RBP-TSTL is a two-stage transfer learning framework for genome-scale prediction of RNA-binding proteins." Briefings in Bioinformatics (2022).

abcair / RBP-TSTL