Metagenomic-DeepFRI

About The Project

Do you have thousands of protein sequences with unknown structures, but still want to know their molecular function, biological process, cellular component and enzyme commission predicted by DeepFRI Graph Convolutional Network?

This is the right project for this task! Pipeline in a nutshell:

Search for similar target protein sequences using MMseqs2.
Align target protein contact map to fit your query protein with unknown structure.
Run predictions on query sequence combined with aligned target contact map or sequence alone if no alignment was found.

Built With

Installation

1. Install environment and DeepFRI

Clone repo locally

git clone https://github.com/bioinf-mcb/Metagenomic-DeepFRI
cd Metagenomic-DeepFRI

Setup conda environment

conda env create --name deepfri --file environment.yml
conda activate deepfri

Install mDeepFRI

pip install .

Verify installation

pytest
mDeepFRI --help

Usage

1. Download models

Run command:

mDeepFRI get-models --output path/to/weights/folder

2. Prepare database

Upload structure (.pdb or .mmcif) files to a folder in your system.
Run command:

mDeepFRI build-db --input path/to/folder/with/strucures --output path/to/database -t threads

Tip: building a database from AF2Swissprot (~550k predicted structures) on 32 CPU cores took ~30 min.

Use parameter -max_len to define maximal length of the protein. Due to initial DeepFRI training set default value is set to 1000.

Main feature of this project is its ability to generate query contact map on the fly using results from mmseqs2 target database search for similar protein sequences with known structures. Later in the metagenomic_deepfri.py contact map alignment is performed to use it as input to DeepFRI GCN. (implemented in CPP_lib/load_contact_maps.h)

The command will search for structure files, process them and store protein sequence and atoms positions inside database/seq_atom_db. It will also create a mmseqs2 database within database/.

You can also use --input DIR_1 FILE_2 ... argument list to parse structures from multiple sources. Accepted formats are: .pdb, .cif, .ent both raw and compressed with .gz

Protein ID is used as a filename. A new protein whose ID already exists in the database will be skipped. Use --overwrite flag to overwrite existing sequences and atoms positions.

3. Predict protein function

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/database/folder/from/previous/step -w /path/to/deepfri/weights/folder -o /output_path

Attention: Single instance of DeepFRI on GPU requires 10GB VRAM.

Other available parameters can be found upon command mDeepFRI --help.

Results

Finished folder will contain:

query_files/* - directory containing all input query files.
mmseqs2_search_results.m8
alignments.json - results of alignment search implemented in utils.search_alignments.py
metadata* - files with some useful info
results* - multiple files from DeepFRI. Organized by model type (GCN or CNN) and its mode (mf, bp, cc, ec) for the total of 8 files. Sometimes results from one model can be missing which means that all query proteins sequences were aligned correctly or none of them were aligned.
```
mf = molecular_function
bp = biological_process
cc = cellular_component
ec = enzyme_commission
```

GPU / CPU utilization

If CUDA is installed on your machine, metaDeepFRI will automatically use it for prediction, no additional installations are needed. If not, the model will use CPUs. If argument threads is provided, the prediction will run on multiple CPU cores.

Citations

If you use this software please cite:

Gligorijević et al. "Structure-based protein function prediction using graph convolutional networks" Nat. Comms. (2021). https://doi.org/10.1038/s41467-021-23303-9
Steinegger & Söding "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets" Nat. Biotechnol. (2017) https://doi.org/10.1038/nbt.3988
Maranga et al. "Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method" mSystems (2023) https://doi.org/10.1128/msystems.01178-22
Daily "Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments" BMC Bioinform. (2016) https://doi.org/10.1186/s12859-016-0930-z

Contributing

If you have a suggestion that would make this project better, please send an e-mail or fork the repo and create a pull request.

Contact

Piotr Kucharski - soliareofastorauj@gmail.com
Valentyn Bezshapkin - valentyn.bezshapkin@micro.biol.ethz.ch

SoliareofAstora / Metagenomic-DeepFRI