francescopatane96 / eNERVE

eNERVE is a dynamic, high-throughput and standalone in silico reverse vaccinology pipeline for eukaryotic protein candidate vaccines (PVCs) discovery from entire proteomes (FASTA file).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

eNERVE v1.0 - Eucaryotic New Enhanced Reverse Vaccinology Environment

run with conda run with docker Jupyter Notebook Python Keras Pandas NumPy Matplotlib TensorFlow Linux macOS Ubuntu Windows Issues

eNERVE is a dynamic, high-throughput and standalone in silico pipeline for PVCs (protein vaccine candidates) discovery in eukaryotic organisms, by a reverse vaccinology approach.

The tool is capable of identifying candidate antigens (immunogens) and their epitopes with ML (machine learning) and alignment analysis methods, using a tree-based approach, from entire proteomes as input (FASTA file of proteomes downloaded from UniProt and other databases or generated from proteomic experiments).

The output generated by the pipeline is a CSV file containing every candidate protein with its relative score and predicted values for every characteristic computed (P_adhesin, P_antigen, location, etc..), a CSV file with the discarded proteins and a series of files (CSVs, PNGs) located in specific folders for every protein selected to predict corresponding linear epitopes.

eNERVE has a flexible usage that allows you to select specific modules and cutoffs to use during proteome analysis and protein classification tasks (please, see the Usage section of this README for more information).

eNERVE is designed to assist experimental research activities in vaccine discovery and vaccine formulation for eucaryotic targets, making use of the data available to the scientific community and using it to create machine learning models that can facilitate and economize the process of antigens discovery to formulate whole protein and epitope-based vaccines.

🦠 Are you searching for a bacteria vaccine discovery pipeline? Please visit NERVE (bNERVE) repository, here and here .

🔴 Before executing the pipeline, please read the Installation Section and the Usage and Examples sections


💻 Pipeline architecture and data flow

alt text

  1. Proteins or entire proteomes downloaded by Databases and proteomic experiments and in .FASTA format is passed to the pipeline;
  2. Quality control of proteome module and generation of protein instances: proteins that do not pass the QC process are discarded (discarded_sequences.fasta); Descriptors are calculated using iFeature library to generate protein descriptors for every protein in the input proteome;
  3. Subcellular location module: Random Forest-based model (scikit-learn) to predict the probability of being 'outer' (training dataset from UniProt);
  4. Adhesin and adhesin-like predictor module: feed-forward neural network (Tensorflow/Keras). Training dataset obtained from literature and InterPro;
  5. Autoimmunity and allergenicity module: Alignment based method (ncbi blast+). Input proteins are aligned with human and mouse proteome in order to catch the level of similarity. What is more, a list of autoimmune and allergenic peptides are screened on the input proteome;
  6. Transmembrane topology prediction module: TMHMM (Hidden Markov model). Predicts the probability of each amino acid to be 'i', 'o', or 'M'. For more information, please visit tmhmm repository ;
  7. Razor module: virtual scissors that cut outer protein pieces and rejoin them. This is useful to retrieve the outer segments of proteins with many transmembrane domains and to discard the latter.
  8. Conservation module: if also the proteome2 is added to the analysis, proteomes are compared.
  9. Selection and scoring module: external proteins are saved in the "vaccine_candidates" file, while internal proteins are saved in the "discarded_proteins" one. Internal proteins with a P_ad > padlimit threshold are retained and saved in the candidates' file.
  10. Linear epitope predictor module: epitopepredict library. For every protein the module predicts its linear epitopes and promiscuous epitopes considering only the 'supertypes alleles' defined by Sette et al.;
  11. Output generation module: a .CSV file is generated which contains every protein instance with all predictions (columns) like score, p_ad, p_loc_out, transmem doms, epitopes, lenght, instability index, etc...

🔥🔥🔥Installation section🔥🔥🔥

:accessibility: Instructions for stand-alone usage with Docker and dockerhub (preferred method)

eNERVE can be used as a stand-alone version taking advantage of Docker and Docker-compose in Linux systems. This method ensures no dependencies-related issues.

  1. install Docker following these instructions and the post-installation procedure

  2. open the terminal and digit:

sudo docker pull francescopatane/enerve:v1.0
  1. create a directory (eg. on Desktop) called 'output'

  2. Then, put your input FASTA files in the just created directory;

  3. Run docker image and select a volume for sharing input from local machine and output from the virtual machine:

sudo docker run --rm -it -v /path/to/output_directory:/workdir francescopatane/enerve:v1.0
  1. run eNERVE pipeline with:
python3 nerve.py -wd /workdir, -p1 [filename.fasta] -args**
  1. At the end of the computation you will find output files in your 'output' directory in the local machine, in this case in your Desktop in the directory output

🏠 Instructions for local installation (tested with python3.10.6 and 3.7.14 versions in Linux/Ubuntu/Unix OS environments). This method is unstable, so we do not ensure proper execution

  1. Open the terminal, move to the location in which you would save eNERVE:
cd /home/ubuntu/Desktop/
  1. clone the repository on your machine:
git clone https://github.com/francescopatane96/eNERVE.git
  1. move to the directory:
cd eNERVE
  1. Clone DeepFri repository:
git clone https://github.com/francescopatane96/DeepFRI.git 
  1. Download pre-trained models from flatironinstitute, then uncompress tar.gz file into the DeepFRI directory:
tar xvzf newest_trained_models.tar.gz -C ./DeepFRI
  1. clone iFeature repository:
git clone https://github.com/francescopatane96/iFeature.git
  1. install ncbi-blast +:
sudo apt-get install ncbi-blast+ 
  1. finally, install remaining dependencies:
pip install -r requirements.txt 

🤖 Instructions for creating a virtual environment with Python venv (Linux/Unix/Mac OS). This method is unstable, so we do not ensure proper execution

  1. open a terminal or a terminal from an IDE (pyCharm or visual code studio);
  2. move to the destination folder:
cd /path/destination/

and clone the repository :

git clone https://github.com/francescopatane96/eNERVE.git
  1. move to root directory:
cd eNERVE
  1. Install python3-venv package digiting:
sudo apt install python3.10-venv
  1. create a virtual environment with the python module venv (to avoid dependencies conflicts) with:
python3 -m venv enerve
  1. activate your new virtual environment with:
source enerve/bin/activate

After venv activation, the terminal will show the virtual environment name between (), eg. (nerve);

  1. Now, you have to install dependencies (from point 4 of the previous section) needed for the pipeline.

🔴 We recommend using conda for creating a virtual environment. This method is unstable, so we do not ensure proper execution

  1. from the terminal, digit:
conda create --name enerve python=3.10
  1. Clone the repository;

  2. activate the environment:

conda activate enerve
  1. install dependencies as in previous sections.

🖥️🖥️🖥️Usage Section🖥️🖥️🖥️

🦮 Usage

usage: nerve.py [-h] [-locl] [adl] [-a] [-ai] [-tp] [-ev] [-ml]
                [-mm] [-m] -p1 [-p2] [-rz] [-ig] [-rl] [-s]
                [-ss] [-tdl] [-ang] [-wd] [-nd] [-id] [-dfd]
                [-ep] [-m1l] [-m2l] [-m1ovr] [-m2ovr] [-prt]
                
                where:
                -h (help), [];
                -locl (loclimit, localization prediction threshold outer class), [float, default=0.60],
                -adl (adhlimit, Retrieve internal proteins if having adh probability > adl), [float, default=0.80];
                -a (Protein functional annotation with DeepFRI), [True, False, default=True];
                -ai (human autoimmunity and allergenicity module), [True, False, default=True];
                -tp (topology), [tmhmm];
                -ev (e-value for blastp), [float, default=1e-10];
                -ml (minlength required for shared peptides to be extracted in comparison analysis versus human and/or mouse) [int, default=9];
                -mm (mismatch, the maximal number of not compatible substitutions allowed in shared peptides alignment windows of minlength size in immunity module, [int, default=1];
                -m (mouse autoimmunity and allergenicity module), [True, False, default=True];
                -p1 (proteome 1 fasta filename or path), [filename.fasta] --> 🔴required🔴;
                -p2 (proteome 2 fasta filename or path), [filename.fasta];
                -rz (razor module), [True, False, default=True];
                -ig (antigenlimit, cutoff value for antigen module), [float, default=0.80] --> 🔵Experimental/Beta🔵;
                -rl (min loop length considered in razor module), [int, default=9];
                -s (selection module), [True, False, default=True];
                -ss (substitution, maximal number of compatible substitutions allowed in shared peptides alignment windows of minlength size in immunity module), [int, default=3];
                -tdl (transmembrane doms limit) [int, default=0] --> 🔵For whole protein vaccines, use a number != 0. For epitope-based predictions, use 0🔵;
                -ang (antigen module), [True, False, default=False] --> 🔵Experimental/Beta🔵;
                -wd [path/to/workdir] --> 🟠recommended🟠;
                -nd [path/to/NERVEdir] --> 🟠recommended🟠;
                -id (iFeature directory), [path/to/ifeature_dir, default=./iFeature];
                -dfd (DeepFri directory), [path/to/deepfri_dir, default=./DeepFRI];
                -ep (epitope prediction module), [True, False, default=True];
                -m1l (mhc1 ligands length), [9,10,11, default=9];
                -m2l (mhc2 ligands length), [9,11,13,15, default=11];
                -m1ovr (mhc1 ligands max overlap), [1,2,default=1];
                -m2ovr (mhc2 ligands max overlap), [1,2, default=1];
                -prt (epitope binders percentile), [float, default=0.80]

⚠️ Remember that required and essential parameters are [-wd], [-p1] and [-nd] if using Local installation. Dockerized version needs only [-wd]. By default, every module is active and will be run. To personalized and deactivate single modules, digit -parameter** False ⚠️


⛑️ Examples section

Run eNERVE in a dockerized environment with all modules

Run eNERVE in a local environment or in a virtual one (no docker) with all modules

Digit on the command line:

python3 code/Nerve.py -p1 proteome.fasta

To run eNERVE without annotation (-a) module, mhc1 ligands length of 10, mhc2 ligands length of 15 and epitope percentile of 80:

python3 code/Nerve.py -p1 proteome.fasta -a False -prt 0.80 -m1l 10 -m2l 15

If you want to create a workid directory in which save outputs, please specify [-wd] (eg. -wd /workdir)

  1. digit:
cd eNERVE
  1. create your working directory (where you will put in fasta.file to be analyzed and where outputs will be saved);

  2. In the terminal, digit and run:

python3 code/Nerve.py -arg1 -arg2 -args**

⚠️ REMEMBER TO SPECIFY -wd (WORKING_DIR) if desired, -p1 (eventually also -p2). Place your fasta inputs into the workdir if just exist. ⚠️

  1. Output files will be saved in the working directory at the end of the computation

🆘🆘🆘Help and Contacts section🆘🆘🆘

📲 References and contacts

eNERVE was developed by Francesco Patanè during his master thesis and internship under the supervision of Prof. Francesco Filippini, at the University of Padova, Synthetic Biology and Biotechnology unit (SynBio).

Special thanks to Francesco Costa, Nicola Gulmini and Andrea Conte for their help and collaboration in this project.

This pipeline was also implemented through the use of packages and libraries created by others (iFeature, epitopepredict, tmhmm, ncbi-blast+, tensorflow and many others), so thanks to dansondergaard, dmnfarrell, superzchen for their pretty and useful tools to predict transmembrane protein topology, linear epitopes and thanks to all the open source community (in particular, the machine learning one).

Have you encountered any problems installing or using the pipeline, or have any suggestions for improving eNERVE? please contact me at the following addresses:

francesco.patane@live.it
francesco.patane.1@studenti.unipd.it

or open an issue

1. Vivona S, Bernante F, Filippini F. NERVE: new enhanced reverse vaccinology environment. BMC Biotechnol. 2006 Jul 18;6:35. doi: 10.1186/1472-6750-6-35. PMID: 16848907; PMCID: PMC1570458.

About

eNERVE is a dynamic, high-throughput and standalone in silico reverse vaccinology pipeline for eukaryotic protein candidate vaccines (PVCs) discovery from entire proteomes (FASTA file).

License:GNU General Public License v3.0


Languages

Language:Jupyter Notebook 94.2%Language:Python 5.6%Language:Dockerfile 0.1%