AutoEval

This repository contains AutoEval, a module for a fast and easy evaluation of FLIP benchmarking tasks. It uses biotrainer to train the task-specific models and bio-embeddings or custom embedders to embed proteins.

Its way of working is as simple as

python run-autoeval.py scl_mixed_soft residues_to_class ./results --embedder Rostlab/prot_t5_xl_uniref50

where

scl_mixed_soft indicates the task and the split to be evaluated,
residues_to_class the protocol used for the tasks,
./results the output directory,
and --embedder Rostlab/prot_t5_xl_uniref50 the embedder from bio-embeddings to be used

The different options are summarized below.

Installation and running

Make sure you have poetry installed:

curl -sSL https://install.python-poetry.org/ | python3 - --version 1.4.2

Install dependencies and biotrainer via poetry:

# In the base directory:
poetry install
# Optional: Add bio-embeddings to compute embeddings
poetry install --extras "bio-embeddings"
# You can also install all extras at once
poetry install --all-extras

To run AutoEval:

with Poetry:

# Option 1:
poetry run autoeval DATASET_SPLIT PROTOCOL WORKING_DIR [...]

# Option 2:
autoeval DATASET_SPLIT PROTOCOL WORKING_DIR [...]

The provieded run-autoeval.py can also be used.

with Docker:

# Build
docker build -t autoeval .
# Run
docker run --rm \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    biotrainer:latest /mnt/config.yml

Options

Parameter	Usage
`split`	Name of the split, e.g. `aav_des_mut`. The different options are listed at the end of this file.
`protocol`	Task-specific training protocol to use from the available ones in biotrainer: `residue_to_class`, `residues_to_class`, `sequence_to_class` and `sequence_to_value`.
`working_dir`	Path to the working directory.
`-e` / `--embedder`	Embedder to use if different from the one in the default configuration. It can be from the ones available in bio-embeddings, e.g. `esm1b`; or a custom embedder (see details here).
`-f` / `--embeddingsfile`	Path to the file containing precomputed embeddings if available.
`-m` / `--model`	Model to use if different fro them one in the default configuration. It should be one from the ones available in biotrainer, e.g. `FNN` or `CNN`.
`-c` / `--config`	Config file different from the provided one in configsbank for the indicated `split`.
`-mins` / `--minsize`	Only use proteins the given minimum length.
`-maxs` / `--maxsize`	Only use proteins the given maximum length.
`-mask` / `--mask`	If set, use the masks in the file `mask.fasta` from the split to filter the residues. It also accepts a path to a different masks file.

Default configurations

For every task, the original configuration is the one used by default (defined in the configsbank folder). A different configuration can be used by changing the input arguments of AutoEval or by copying and changing the given one. The default can be overwritten using --config NEW_CONFIG.yml.

Dataset	Type of task	Recommended pLM Embeddings	Recommended model	Reference	Available in Configsbank
`AAV`	sequence_to_value	-	FNN	[Dallago 2021]	⚠️
`GB1`	sequence_to_value	-	FNN	[Dallago 2021]	⚠️
`Meltome`	sequence_to_value	-	FNN	[Dallago 2021]	⚠️
`SCL`	residues_to_class	ProtT5 (ProtT5-XL-UniRef50)	LightAttention	[Stärk 2021]	✅
`Bind`	residue_to_class	ProtT5 (ProtT5-XL-UniRef50)	CNN	[Littmann 2021]	✅
`SAV`	sequence_to_class	ProtT5 (ProtT5-XL-U50)	FNN	[Marquet 2021]	⚠️
`Secondary Structure`	residue_to_class	ProtT5 (ProtT5-XL-U50)	CNN	-	✅
`Conservation`	residue_to_class	ProtT5 (ProtT5-XL-U50)	CNN	[Marquet 2021]	✅

Availability semaphore:

✅: Available in configsbank is the closest possible way to the best configuration in the reference.
⚠️: The best configuration is not possible due to, e.g., a lack of features in biotrainer. The best possible alternative is the one available.
❌: Not available in configsbank. Somecases can be used anyhow under user's responsability.

Available splits

In order to reference the split to be evaluated the pattern dataset_split must be followed. For example, the split seven_vs_many from the dataset aav must be referenced as aav_seven_vs_many.

Dataset	Splits
`AAV` (`aav_*`)	`des_mut`, `mut_des`, `one_vs_many`, `two_vs_many`, `seven_vs_many`, `low_vs_high`, `sampled`
`Meltome` (`meltome_*`)	`mixed_split`, `human`, `human_cell`
`GB1` (`gb1_*`)	`one_vs_rest`, `two_vs_rest`, `three_vs_rest`, `low_vs_high`, `sampled`
`SCL` (`scl_*`)	`mixed_soft`, `mixed_hard`, `human_soft`, `human_hard`, `balanced`, `mixed_vs_human_2`
`Bind` (`bind_*`)	`one_vs_many`, `two_vs_many`, `from_publication`, `one_vs_sm`, `one_vs_mn`, `one_vs_sn`
`SAV` (`sav_*`)	`mixed`, `human`, `only_savs`
`Secondary Structure` (`secondary_structure_*`)	`sampled`
`Conservation` (`conservation_*`)	`sampled`

J-SNACKKB / autoeval

AutoEval

Installation and running

Options

Default configurations

Available splits

About

Languages