DeepRNA - Easy & Pretrained Deep Learning for RNA predictions Implemented with Tensorflow 2.X with Keras API
Features
- Architecture inspired by winning models in mRNA OpenVaccine competition (2020)
- Pretrained DeepRNA models, easy to use, Keras-style
- Applicable to general RNA prediction problems
- Built-in Dataset with can handle RNA of mixed lengths simultaneously (powered by Spektral's Graph Loader with little upgrade ;)
- Built-in Self-Supervised AutoEncoder pretraining
- Companion feature extractions from basic to advanced of any RNA dataset with minimum assumptions (See full tutorial below)
- Advanced techniques like pseudo-labeling and uncertainty-handling can be done relatively easily (See the 2nd tutorial below)
Step-by-Step Tutorials on Kaggle
Why Kaggle? Because Kaggle is almost like a "free" Colab "Pro" with an extra plus of free permanent storage which can easily transfer to any working notebook. Kaggle working environment is amazing!
-
A tutorial on "Preprocessing RNA strings for Deep Learning Models in a General "Graph" Setting
-
Quick and advanced tutorials on "Finetune Pretrained State-of-the-Art DeepRNA Model to General Prediction Problems Made Easy"
Benchmark Results
Prediction provided by our tutorial is, as of Feb 2022, provided the highest scores in OpenVaccine's public notebooks (see Benchmark in this page ). Note that in Kaggle benchmark, techniques such as multi-model and kfolds ensemble are standard)
Quick Start in 4 Steps
Step 1. Clone the repo to your working directory
git clone https://ratthachat@github.com/ratthachat/deep-rna.git
cp -rf ./deep-rna/deep_rna ./
Step 2. Prepare your RNA dataset using the default option as suggested in this tutorial.
After this step, you will have a list of RNA ids rna_id_list
, and directories containing
RNA node and edge features i.e. NODE_DIR
and EDGE_DIR
.
Step 3. Make and load your dataset
from deep_rna.dataset import RNADataset
from deep_rna.spektral.data import BatchLoader
rna_dataset = RNADataset(rna_id_list,
node_dir=NODE_DIR,
edge_dir=EDGE_DIR,
manhattan_edge_feature=True)
batch_loader = BatchLoader(rna_dataset, batch_size=128, mask=True, shuffle=True, epochs=1) # set epochs=None to load indefinitly
Step 4. Load the pretrained model and get the RNA embedding vector so that you can add them into your ML pipeline!
from deep_rna.models import RNAPretrainedModel
model = RNAPretrainedModel(weights='openvaccine', include_top=False)
for x in batch_loader.load():
embed = model.predict(x)
Acknowledgement
- My friends Akensert and Raman who inspired me about this project.
- Amazing data scientists at Kaggle who contributed to the mRNA modeling and be a backbone to this project : Gilles Vandewiele, tito, Mrkmakr, Jiayang Gao and xhlulu
- Spektral Project whom I borrow and little modify their wonderful graph loader to handle RNA of arbitrary-length sequences.