SoliareofAstora / Metagenomic-DeepFRI

Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metagenomic-DeepFRI

About The Project

Do you have thousands of protein sequences with unknown structures, but still want to know their molecular function, biological process, cellular component and enzyme commission predicted by DeepFRI Graph Convolutional Network?

This is the right project for this task! Pipeline in a nutshell:

  1. Search for similar target protein sequences using MMseqs2.
  2. Align target protein contact map to fit your query protein with unknown structure.
  3. Run predictions on query sequence combined with aligned target contact map or sequence alone if no alignment was found.

Built With

Installation

1. Install environment and DeepFRI

  1. Clone repo locally
git clone https://github.com/bioinf-mcb/Metagenomic-DeepFRI
cd Metagenomic-DeepFRI
  1. Setup conda environment
conda env create --name deepfri --file environment.yml
conda activate deepfri
  1. Install mDeepFRI
pip install .
  1. Verify installation
pytest
mDeepFRI --help

Usage

1. Download models

Run command:

mDeepFRI get-models --output path/to/weights/folder

2. Prepare database

  1. Upload structure (.pdb or .mmcif) files to a folder in your system.
  2. Run command:
mDeepFRI build-db --input path/to/folder/with/strucures --output path/to/database -t threads

Tip: building a database from AF2Swissprot (~550k predicted structures) on 32 CPU cores took ~30 min.

Use parameter -max_len to define maximal length of the protein. Due to initial DeepFRI training set default value is set to 1000.

Main feature of this project is its ability to generate query contact map on the fly using results from mmseqs2 target database search for similar protein sequences with known structures. Later in the metagenomic_deepfri.py contact map alignment is performed to use it as input to DeepFRI GCN. (implemented in CPP_lib/load_contact_maps.h)

The command will search for structure files, process them and store protein sequence and atoms positions inside database/seq_atom_db. It will also create a mmseqs2 database within database/.

You can also use --input DIR_1 FILE_2 ... argument list to parse structures from multiple sources. Accepted formats are: .pdb, .cif, .ent both raw and compressed with .gz

Protein ID is used as a filename. A new protein whose ID already exists in the database will be skipped. Use --overwrite flag to overwrite existing sequences and atoms positions.

3. Predict protein function

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/database/folder/from/previous/step -w /path/to/deepfri/weights/folder -o /output_path

Attention: Single instance of DeepFRI on GPU requires 10GB VRAM.

Other available parameters can be found upon command mDeepFRI --help.

Results

Finished folder will contain:

  1. query_files/* - directory containing all input query files.
  2. mmseqs2_search_results.m8
  3. alignments.json - results of alignment search implemented in utils.search_alignments.py
  4. metadata* - files with some useful info
  5. results* - multiple files from DeepFRI. Organized by model type (GCN or CNN) and its mode (mf, bp, cc, ec) for the total of 8 files. Sometimes results from one model can be missing which means that all query proteins sequences were aligned correctly or none of them were aligned.
    mf = molecular_function
    bp = biological_process
    cc = cellular_component
    ec = enzyme_commission
    

GPU / CPU utilization

If CUDA is installed on your machine, metaDeepFRI will automatically use it for prediction, no additional installations are needed. If not, the model will use CPUs. If argument threads is provided, the prediction will run on multiple CPU cores.

Citations

If you use this software please cite:

Contributing

If you have a suggestion that would make this project better, please send an e-mail or fork the repo and create a pull request.

Contact

Piotr Kucharski - soliareofastorauj@gmail.com
Valentyn Bezshapkin - valentyn.bezshapkin@micro.biol.ethz.ch

About

Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Python 86.2%Language:C++ 13.0%Language:CMake 0.7%