Biotrainer

Biotrainer is an open-source tool to simplify the training process of machine-learning models for biological applications. It specializes on training models to predict features for proteins. Using biotrainer comes as simple as providing your sequence and label data in the correct format, along with a configuration file.

Data standardization

Biotrainer provides a lot of data standards, designed to ease the usage of machine learning for biology. This standardization process is also expected to improve communication between different scientific disciplines and help to keep the overview about the rapidly developing field of protein prediction.

Available protocols

The protocol defines, how the input data should be interpreted and which kind of prediction task has to be applied. The following protocols are already implemented:

D=embedding dimension (e.g. 1024)
B=batch dimension (e.g. 30)
L=sequence dimension (e.g. 350)
C=number of classes (e.g. 13)

- residue_to_class --> Predict a class C for each residue encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxLxC
- residues_to_class --> Predict a class C for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxC
- residues_to_value --> Predict a value V for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output Bx1
- sequence_to_class --> Predict a class C for each sequence encoded in a fixed dimension D. Input BxD --> output BxC
- sequence_to_value --> Predict a value V for each sequence encoded in a fixed dimension D. Input BxD --> output Bx1

Input file standardization

For every protocol, we created a standardization on how the input data must be provided. You can find detailed information for each protocol here.

Below, we show an example on how the sequence and label file must look like for the residue_to_class protocol:

sequences.fasta

>Seq1
SEQWENCE

labels.fasta

>Seq1 SET=train VALIDATION=False
DVCDVVDD

Configuration file

To run biotrainer, you need to provide a configuration file in .yaml format along with your sequence and label data. Here you can find an exemplary file for the residue_to_class protocol. All configuration options are listed here.

Example configuration for residue_to_class:

protocol: residue_to_class
sequence_file: sequences.fasta # Specify your sequence file
labels_file: labels.fasta # Specify your label file
model_choice: CNN # Model architecture 
optimizer_choice: adam # Model optimizer
learning_rate: 1e-3 # Optimizer learning rate
loss_choice: cross_entropy_loss # Loss function 
use_class_weights: True # Balance class weights by using class sample size in the given dataset
num_epochs: 200 # Number of maximum epochs
batch_size: 128 # Batch size
embedder_name: Rostlab/prot_t5_xl_uniref50 # Embedder to use

(Bio-)Embeddings

To convert the sequence data to more meaningful input for a model, embeddings generated by protein language models (pLMs) have become widely applied in the last years. Hence, biotrainer enables automatic calculation of embeddings on a per-sequence and per-residue level, depending on the protocol. Take a look at the embeddings options to find out about all the available embedding methods. It is also possible to provide your own embeddings file using your own embedder, independent of the provided calculation pipeline. Please refer to the data standardization document and the relevant examples to learn how to do this. Pre-calculated embeddings can be used for the training process via the embeddings_file parameter, as described in the configuration options.

Installation

Make sure you have poetry installed:

curl -sSL https://install.python-poetry.org/ | python3 -

Install dependencies and biotrainer via poetry:

# In the base directory:
poetry install
# Adding jupyter notebook (if needed):
poetry add jupyter

Running

cd examples/residue_to_class
poetry run biotrainer config.yml

You can also use the provided run-biotrainer.py file for development and debugging (you might want to set up your IDE to directly execute run-biotrainer.py with the provided virtual environment):

# residue_to_class
poetry run python3 run-biotrainer.py examples/residue_to_class/config.yml
# sequence_to_class
poetry run python3 biotrainer.py examples/sequence_to_class/config.yml

Docker

# Build
docker build -t biotrainer .
# Run
docker run --rm \
    -v "$(pwd)/examples/docker":/mnt \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    biotrainer:latest /mnt/config.yml

Output can be found afterward in the directory of the provided configuration file.

Citation

If you are using biotrainer for your work, please add a citation:

@inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}

sacdallago / biotrainer