liuzi919 / DMPfold

Extending genome-scale de novo protein modelling coverage using iterative deep learning-based prediction of structural constraints

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DMPfold

Build Status

Extending genome-scale de novo protein modelling coverage using iterative deep learning-based prediction of structural constraints

See the pre-print for more.

Installation

As it makes use of a lot of different software, installation can be a little fiddly. However we have aimed to make it as straightforward as possible. These instructions should work for a Linux system:

  • Make sure you have Python 3 with PyTorch 0.4 or later, NumPy and SciPy installed. GPU setup is optional for Pytorch - it won't speed things up much because running the network isn't a time-consuming step. DMPfold has been tested on Python 3.6 and 3.7.
  • Install HH-suite and the uniclust30 database, unless you are getting your alignments from elsewhere.
  • Install FreeContact.
  • Install CCMpred.
  • Install MODELLER, which requires a license key. Only the Python package is required so this can be installed with conda install modeller -c salilab.
  • Install CNS. Change the nrestraints = 20000 line in cns_solve_1.3/modules/nmr/readdata to a larger number, e.g. nrestraints = 30000, to allow DMPfold to run on larger structures.
  • Download and patch the required CNS scripts by changing into the cnsfiles directory and running sh installscripts.sh.
  • Install the legacy BLAST software, in particular formatdb, blastpgp and makemat. We may update this to BLAST+ in the future.
  • Other software is pre-compiled and included here (PSIPRED, PSICOV, various utility scripts with the code in src). This should run okay but may need separate compilation using the makefile if issues arise. Some other standard programs, such as csh shell, are assumed.
  • Change lines 10/13-15/18/21/24 in seq2maps.csh, lines 11/14/17/20 in aln2maps.csh, lines 4/7 in bin/runpsipredandsolvwithdb and lines 10/13 in run_dmpfold.sh to point to the installed locations of the above software. You can also set the number of cores to use in seq2maps.csh and aln2maps.csh.

You may need to set ulimit -s unlimited to get seq2maps.csh to work. Check the continuous integration setup script and logs for additional tips and a step-by-step installation on Ubuntu.

Usage

Here we give an example of running DMPfold on CASP12 target T0864. First you need to generate the .21c and .map files. This can be done in one of two ways:

  • From a single sequence: csh seq2maps.csh T0864.fasta to run HHblits, PSIPRED, SOLVPRED, PSICOV, FreeContact, CCMpred and alnstats.
  • From an alignment: csh aln2maps.csh T0864.aln to run PSIPRED, SOLVPRED, PSICOV, FreeContact, CCMpred and alnstats. The file T0864.aln has one sequence per line with the ungapped target sequence as the first line.

Then run sh run_dmpfold.sh T0864.fasta T0864.21c T0864.map ./T0864 to run DMPfold, where the last parameter is an output directory that will be created. The final model is final_1.pdb and other structures may or may not be generated as final_2.pdb to final_5.pdb if they are significantly different. Running sh run_dmpfold.sh T0864.fasta T0864.21c T0864.map ./T0864 5 20 instead runs 5 iterations with 20 models per iteration (default is 3 and 50).

Data

Models for the 1,475 Pfam families modelled in the paper can be downloaded here. Additional models for the remainder of the dark Pfam families can be downloaded here (some were not modelled due to small sequence alignments). Alignments for the Pfam families without available templates can be downloaded here. The format is one sequence per line with the ungapped target sequence as the first line.

The directory pfam in this repository contains text files with the lists from Figure 4A of the paper, target sequences for modelled families and data for modelled families (sequence length, effective sequence count, distogram satisfaction scores, estimated TM-score and probability TM-score >= 0.5).

About

Extending genome-scale de novo protein modelling coverage using iterative deep learning-based prediction of structural constraints

License:GNU General Public License v3.0


Languages

Language:C 75.8%Language:Python 16.4%Language:Shell 5.6%Language:Makefile 1.3%Language:Perl 1.0%