J-SNACKKB / autoeval

Module to auto evaluate FLIP datasets via bio-trainer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AutoEval

This repository contains the AutoEval module, which allows to auto evaluate FLIP datasets using bio-trainer to train the models and bio-embeddings to embed the proteins.

Together with the scripts, this repository also contains a bank of optimal or base configurations (in the configsbank folder) for each of the available datasets in FLIP. These configuration files are general versions for each dataset and they are modified by the script. The expected to be usually changed parameters (embedder_name and model_choice) can be changed using input parameters. A different configuration file can be used using the input parameters, as explained below.

Here are available all the FLIP datasets in FASTA format for, e.g., those cases that it is needed to obtain different embeddings from the ones available in bio-embeddings, and the FASTA files are required. When different embeddings or modifications to the data are not required, AutoEval automatically converts FLIP CSV format to FASTA (or reads directly those datasets already in FASTA).

How to run AutoEval

AutoEval can be executed:

  • via Poetry:
# Make sure you have poetry installed
curl -sSL https://install.python-poetry.org/ | python3 - --version 1.1.13

# Install dependencies and AutoEval via poetry
poetry install

# Run
poetry run python3 run-autoeval.py split_abbreviation protocol /path/to/working_directory [--embedder embedder_name] [--embeddingsfile embeddings_path] \
    [--model model_name] [--config config_name] \
    [--minsize min_size] [--maxsize max_size]
    [--mask]

Example:

poetry run python3 run-autoeval.py scl_1 residues_to_class ./scl_1 --embedder prottrans_t5_xl_u50
  • via Command Line:
python run-autoeval.py split_abbreviation protocol /path/to/working_directory [--embedder embedder_name] [--embeddingsfile embeddings_path] \
    [--model model_name] [--config config_name] \
    [--minsize min_size] [--maxsize max_size]
    [--mask]

Example:

python run-autoeval.py scl_1 residues_to_class ./scl_1 --embedder prottrans_t5_xl_u50
  • via Docker:
-

The available input parameters are:

Parameter Usage
split Name of the split. It should be indicated using the abbreviations in the table below.
protocol Protocol to use from the available ones in bio-trainer.
working_dir Path to the folder to save the required files and results.
-e / --embedder To indicate the embedder to use if different from the one in the config file. It should be one from the ones available in bio-embeddings.
-f / --embeddingsfile To indicate the path to the file containing precomputed embeddings.
-m / --model To indicate the model to use if different fro the one in the config file. It houls be one form the ones available in bio-trainer
-c / --config Config file different from the provided one in configsbank for the indicated split.
-mins / --minsize Use proteins with more than minsize residues.
-maxs / --maxsize Use proteins with less than maxsize residues.
-mask / --mask If set, use the masks in the file mask.fasta from the working directory to filter the residues.

Recommended configurations per dataset

Dataset Type of task Recommended pLM Embeddings Recommended model Reference Available in Configsbank
AAV sequence_to_value - FNN [Dallago 2021] ⚠️
GB1 sequence_to_value - FNN [Dallago 2021] ⚠️
Meltome sequence_to_value - FNN [Dallago 2021] ⚠️
SCL residues_to_class ProtT5 (ProtT5-XL-UniRef50) LightAttention [Stärk 2021]
Bind residue_to_class ProtT5 (ProtT5-XL-UniRef50) CNN [Littmann 2021]
SAV sequence_to_class ProtT5 (ProtT5-XL-U50) FNN [Marquet 2021] ⚠️
Secondary Structure residue_to_class ProtT5 (ProtT5-XL-U50) CNN -
Conservation residue_to_class ProtT5 (ProtT5-XL-U50) CNN [Marquet 2021]

Availability semaphore:

  • : Available in configsbank in the closest possible way to the better configuration in the reference.
  • ⚠️: The best configuration is not possible due to, e.g., a lack of features (temporarily) in biotrainer. The best possible alternative is the one available.
  • : Not available in configsbank. Somecases can be used anyhow under user's responsability.

Available splits

Dataset Split Abbreviation Split Abbreviation
AAV des_mut aav_1 mut_des aav_2
one_vs_many aav_3 two_vs_many aav_4
seven_vs_many aav_5 low_vs_high aav_6
sampled aav_7
Meltome mixed_split meltome_1 human meltome_2
human_cell meltome_3
GB1 one_vs_rest gb1_1 two_vs_rest gb1_2
three_vs_rest gb1_3 low_vs_high gb1_4
sampled gb_5
SCL mixed_soft scl_1 mixed_hard scl_2
human_soft scl_3 human_hard scl_4
balanced scl_5 mixed_vs_human_2 scl_6
Bind one_vs_many bind_1 two_vs_many bind_2
from_publication bind_3 one_vs_sm bind_4
one_vs_mn bind_5 one_vs_sn bind_6
SAV mixed sav_1 human sav_2
only_savs sav_3
Secondary Structure sampled secondary_structure
Conservation sampled conservation

About

Module to auto evaluate FLIP datasets via bio-trainer

License:Academic Free License v3.0


Languages

Language:Python 99.4%Language:Dockerfile 0.6%