PREFMD2: Protein REFinement via Molecular Dynamics (version 2)

Improvement of global structure of protein 3D models via molecular dynamics (MD) and structural averaging (as seen in CASP14).

The PREFMD2 pipeline has two modes:

Single initial model mode: the default mode. MD sampling is conducted starting from an initial 3D model supplied by the user. This mode is the one implemented in the publicly available Feig lab webserver: https://feig.bch.msu.edu/web/services/prefmd/.
Multiple initial models mode: MD sampling is conducted from a user-supplied initial 3D model and from four additional conformations. These conformations are obtained by hybridizing (through the Iterative hybridize protocol of Rosetta) the user-supplied model with multiple homology models of the same protein (generated by MODELLER). This mode is computationally more expensive than the default one, but usually produces more accurate results when the target protein has templates available in the PDB.

For a detailed description and comparison of the two modes see [1].

For the Feig's lab refinement protocol used in CASP13, please see: https://github.com/feiglab/prefmd

1. Installation prerequisites and dependencies

The PREFMD2 pipeline runs on Linux systems. In order to use it on your machine, you need Python 3.6+ and you must install a series of dependencies. Note: running the multiple initial models mode requires to install extra dependencies (which are not required to run the single initial model mode).

1.1 Python dependencies

Make sure to install these Python libraries.

OpenMM
- Website: http://openmm.org/
- Note: by default PREFMD2 will use the CUDA platform. You can change the OpenMM platform that PREFMD2 will use by setting the $PREFMD2_OPENMM_PLATFORM environmental variable.
- Role in the pipeline: running MD simulations.
mdtraj
- Website: https://github.com/mdtraj/mdtraj
- Role in the pipeline: parsing and extracting data from MD trajectory files.
scikit-learn
- Website: https://scikit-learn.org/stable/
- Role in the pipeline: clustering of MD snapshots.
MODELLER (optional, used only in multiple initial models mode)
- Website: https://salilab.org/modeller/
- Role in the pipeline: performing template-based 3D modeling in the multiple initial models mode.

1.2 Required third-party programs

Make sure to install these dependencies and to set the required environmental variables (as explained in the Configuration sections). The .bashrc files in the default directory of this repository give an example of what your environmental variables should look like.

1.2.a Basic requirements

CHARMM
- Obtain from: http://charmm.chemistry.harvard.edu
- Configuration: once you have installed CHARMM, set the following environmental variables:
  - CHARMMEXEC: path to the executable file of CHARMM.
- Role in the pipeline: it is a dependency for locPREFMD (see below) and is used to prepare input files for the MD runs.
MMSTSB
- Obtain from: https://github.com/mmtsb/toolset
- Configuration: once you have compiled the toolset, make sure that you have set the following environmental variables (you should already have set them during the MMSTSB installation process, but they are repeated here for a double check):
  - MMTSBDIR: top directory of the locally-installed Git repository.
  - CHARMMDATA: path to $MMTSBDIR/data/charmm.
  - Add $MMTSBDIR/bin and $MMTSBDIR/perl to your $PATH.
- Role in the pipeline: contains scripts necessary to manipulate PDB files and it is a dependency for locPREFMD (see below).
locPREFMD
- Obtain from: https://github.com/feiglab/locprefmd
- Configuration: follow the installation instructions in the GitHub link and make sure that you have set the following environmental variables (note that you should already have set them during the locPREFMD installation):
  - LOCPREFMD: path to the locPREFMD Git repository after checking out.
  - MOLPROBITY: path to the top of the MolProbity tree.
- Role in the pipeline: used to perform initial stereochemical refinement on the input model and on the averaged models.
mdconv
- Obtain from: https://github.com/feiglab/mdconv
- Configuration: download the source code, compile and:
  - Add the directory with the mdconv executable to your $PATH.
- Role in the pipeline: modifies the trajectories files generated in the production MD runs.
TMscore
- Obtain from: https://zhanglab.ccmb.med.umich.edu/TM-score/
- Configuration: download the source code, compile and:
  - Add the directory with the TMscore executable to your $PATH.
- Role in the pipeline: in the scoring phase, it compares the structures extracted from the MD trajectories to the initial model.
RWplus
- Obtain from: https://zhanglab.ccmb.med.umich.edu/RW/
- Configuration: download the calRWplus program. Then set the following environmental variable:
  - RWPLUS_HOME: path to the RWplus home directory (where the calRWplus executable is located).
- Role in the pipeline: scores the structures extracted from the MD trajectories in order to filter them before the averaging stage.
Scwrl4
- Obtain from: http://dunbrack.fccc.edu/SCWRL3.php/
- Configuration: download and install the Scwrl4 program. Then:
  - Add the directory with the scwrl4 executable to your $PATH.
- Role in the pipeline: repacks the side chains of the averaged models.

1.2.b Optional dependencies (only used in multiple initial models mode)

HHsuite
- Obtain from: https://github.com/soedinglab/hh-suite
- Also make sure to obtain:
  - A Uniclust30 database (to be used by hhblits): https://uniclust.mmseqs.com/
  - A PDB70 database (to be used by hhsearch when scanning for templates): http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/
- Configuration: install the suite and the required databases, the define the following environmental variables:
  - HHSUITE_SEQ_DB: database path for the Uniclust30 database. This and the following variable ($HHSUITE_PDB_DB) should be the same paths that you supply as the -d argument when using the hhblits or hhsearch programs.
  - HHSUITE_PDB_DB: database path for the PDB70 database.
  - Add the directory with the HHsuite executables to your $PATH.
- Role in the pipeline: identifies templates for the input protein. The templates will be used to build homology models of the protein using MODELLER.
TMalign
- Obtain from: https://zhanglab.ccmb.med.umich.edu/TM-align/
- Configuration: download the source code, compile and:
  - Add the directory with the TMalign executable to your $PATH.
- Role in the pipeline: compares the initial model 3D structure with the templates identified by the HHsuite programs.
Rosetta software suite
- Obtain from: https://www.rosettacommons.org/software
- Configuration: once you have installed Rosetta, set the following environmental variables:
  - ROSETTA_HOME: path of the home directory of Rosetta (this is the directory where the demos, documentation, main and tools directories of the Rosetta suite are located).
  - ROSETTA_EXTENSION (optional): name of the extension of the Rosetta binary files. If you do not specify it, PREFMD2 will assume that your Rosetta binaries have the linuxgccrelease extension. Depending on how you obtained the Rosetta binaries, you could have to modify it. For example, if you are using pre-compiled binaries on Linux, you should set this to static.linuxgccrelease.
- Role in the pipeline: run a modified version of the Iterative hybridize protocol in order to hybridize the initial user-supplied 3D model with the template-based models built by MODELLER.
GNU parallel
- Obtain from: https://www.gnu.org/software/parallel/
- Note: you may probably be able to install this program using the package manager of your Linux distribution.
- Configuration: the directory where the parallel executable file is located must be in your $PATH.
- Role in the pipeline: used to parallelize the Iterative hybridize protocol of Rosetta.

2. Obtaining and configuring PREFMD2

2.1 Getting PREFMD2

Once you have installed the required PREFMD2 dependencies, clone the PREFMD2 GitHub repository on your system. Run:

git clone https://github.com/feiglab/prefmd2.git

Then set the following environmental variable:

PREFMD2_HOME: this should be the path of the PREFMD2 directory that you cloned on your system.

2.2 Preparing force field files

PREFMD2 uses for the MD runs in the main sampling stage and its preceding equilibration a modified version of the CHARMM36m force field. The files for this force field are provided in this repository.

For the relaxation of averaged structures and model quality assessment steps, the original version of CHARMM36m is used. The files for this force field are NOT provided in this repository. In order to use PREFMD2, you will need to provide your own CHARMM36m force field files, which are available from the CHARMM distribution (in the toppar directory). Note that CHARMM provides protein force field files separately from water and ions. In order to use PREFMD2 you will need to specify the following three environmental variables:

$PREFMD2_FF_PARAMETER: path to the parameter file of the selected force field, for example $HOME/apps/charmm/toppar/par_all36_prot.prm (assuming that your CHARMM installation is in $HOME/apps/charmm).
$PREFMD2_FF_TOPOLOGY: path to the topology file of the selected force field, for example $HOME/apps/charmm/toppar/top_all36_prot.rtf.
$PREFMD2_FF_WATER_IONS: path to the water and ions parameter file of the selected force field, for example $HOME/apps/charmm/toppar/toppar_water_ions.str.

Although in principle any force field can be used in PREFMD2, we recommend the CHARMM36m force field.

3. How to use PREFMD2

3.1 Basic usage

Prepare an input protein structure in PDB format. Then run:

python $PREFMD2_HOME/scripts/prefmd2.py -t my_refinement_job -i input.pdb

This will run the default single initial model mode. -t is the name of the refinement job (the prefmd2.py script will create a directory named my_refinement_job and write all its output files in it) and -i is the path of the PDB file of the 3D model that you want to refine. Using the default options, a typical refinement job takes around ~24 hours to complete when using a single GPU for a ~120 amino acid protein. Once a job is completed, prefmd2.py will output 5 final models [1]. They can be found in the final directory inside the output directory (in the example above, the my_refinement_job directory).

3.2 List of options

-d/--dir: working directory. The pipeline will be executed here and output directory will be written in it.
-v/--verbose: set verbose mode.
--cpus: number of CPUs to be used (default: 8).
--gpus: ids of the GPUs to use in the job (default=0). Examples: 1 (only GPU 1 will be used), 0:1 (use GPU 0 and 1), 0:1:3 (use GPU 0, 1 and 3). Each GPU will be used for a MD run when multiple MD runs can be run in parallel (e.g.: when performing the 5 MD production runs). This option will only take effect if your OpenMM platform uses GPU acceleration.
--hybrid: perform the multiple initial models mode.
--extensive: use longer MD production runs.
--force: overwrite a previous output directory if needed.
--stage: name of the stage of the refinement pipeline to be run. By default it is 'all' (the whole pipeline will be executed).
-j/--json: file path of the json file in a PREFMD2 output directory. It must be supplied when resuming a previous job to execute a specific stage using the --stage argument.

4. Release log

1/21/2020: set up the repository.

5. References

Heo L, Arbour CF, Janson G, Feig M. Improved Sampling Strategies for Protein Model Refinement Based on Molecular Dynamics Simulation. J Chem Theory Comput (2021) Feb 9. PMID: 33562962.

6. Contact

mfeiglab@gmail.com

zhenglz / prefmd2