plmc infers undirected graphical models to describe coevolution and covariation in families of biological sequences. With a multiple sequence alignment as an input, plmc can quantify inferred coupling strengths between all pairs of positions (couplingsfile output) or infer a generative model of the sequences for predicting the effects of mutations or designing new sequences (paramfile output).
plmc [options] alignmentfile
plmc -c couplingsfile alignmentfile
plmc -o paramfile -c couplingsfile alignmentfile
plmc [-h | --help]
Required input:
alignmentfile Multiple sequence alignment in FASTA format
Options, output:
-c --couplings couplingsfile Save coupling scores to file (text)
-o --output paramfile Save estimated parameters to file (binary)
Options, alignment processing:
-s --scale <value> Sequence weights: neighborhood weight [s > 0]
-t --theta <value> Sequence weights: neighborhood divergence [0 < t < 1]
Options, Maximum a posteriori estimation (L-BFGS, default):
-lh --lambdah <value> Set L2 lambda for fields (h_i)
-le --lambdae <value> Set L2 lambda for couplings (e_ij)
-lg --lambdag <value> Set group L1 lambda for couplings (e_ij)
Options, general:
--fast Fast weights and stochastic gradient descent
-a --alphabet alphabet Alternative alphabet (default 21: -ACDEFGHIKLMNPQRSTVWY)
-f --focus identifier Select only uppercase, non-gapped sites from a focus sequence
-g --gapignore Exclude first alphabet character from potential calculations
-m --maxiter Maximum number of iterations
-n --ncores [<number>|max] Maximum number of threads to use in OpenMP
-h --help Usage
plmc requires no external libraries, but can optionally be accelerated with OpenMP for multicore parallelism.
Multicore. To compile with gcc
and OpenMP:
make all-openmp
On macOS, OpenMP requires that an external version of GCC be installed rather than clang
. Precompiled binaries of GCC are available here or can be downloaded through package managers like homebrew or macports.
Single core, Linux. To compile with gcc
:
make all
Single core, Mac OS X. To compile with clang
:
make all-mac
Single precision. All of the above targets compile to double precision (64 bit), but reducing the precision to single (32 bit) increases speed and decreases memory requirements by approximately a factor of two. The fastest compile settings are then:
make all-openmp32
Coupling scores. The couplingsfile
is a flat text file containing scores quantifying the inferred strength of the coupling between every pair of positions. It has 6 columns: RES_I FOCUS_AI RES_J FOCUS_AJ 0 SCORE
, where SCORE
is the coupling score between positions RES_I
and RES_J
, FOCUS_AI
and FOCUS_AJ
are the letters in the focus sequence (optional, -
if no focus), 0
is a placeholder. The SCORE
values are APC-corrected Frobenius norm scores, but alternative scores can be computed from the raw parameter values.
Parameter estimates. The optional paramfile
specified with -o
, will store all inferred model parameters in binary. These can get large, as for proteins the model explicitly parameterizes all possible pairs of amino acids at all possible pairs of postions, which is about 106-108 numbers for families of lengths ~70-700. The MATLAB script scripts/read_params.m
unpacks this binary file into model parameters as well as associated metadata, such as inferred sequence weights.
Protein alignments. The example directory includes an alignment of the protein dihdyrofolate reductase (DHFR). To infer a model for this family, we can type the following in the base directory:
bin/plmc -o example/protein/DHFR.params -le 16.0 -lh 0.01 -m 100 -g -f DYR_ECOLI example/protein/DHFR.a2m
The numeric options set a strong L2 regularization for the couplings, λe = 16.0, a weak L2 regularization for the sites, λh = 0.01, and the maximum number of iterations at 100. The focus -f
option tells plmc to only model columns that are upper-case in the E. coli sequence DYR_ECOLI (NOTE: for focus mode, the alignment should only contain columns that are coding in the focus sequence, otherwise the offsets for the output sequence will be incorrect.) The -g
gap-ignoring option ignores gaps by modeling only the coding portions of each sequence. To read the binary paramfile DHFR.eij
and visualize the couplings, we can type the following in MATLAB from the scripts
directory:
params = read_params('../example/protein/DHFR.params');
plot_coupling_scores(params)
This computes and plots the same (APC-corrected) coupling strengths that would be output to couplingfile, revealing many strongly coupled pairs of positions in the long-term evolution of DHFR:
Each coupling strength summarizes a larger set of coupling parameters, since the model explicitly parameterizes each possible combination of amino acids at each pair of positions. To visualize these sequence-level parameters, we can use a JavaScript tool EVzoom that is designed to visualize plmc models. For efficiency, we export only the strongly coupled pairs (culled by an EM-based outlier-detection method) in a lightweight JSON format with
export_couplings_json(read_params('../example/protein/DHFR.params'), '../example/protein/DHFR.json')
EVzoom makes it possible to navigate the large number of sequence-level parameters present in undirected models of sequence families:
RNA alignments. An example RNA alignment is included for the SAM riboswitch. To infer the couplings with an RNA alphabet (.ACGU) type the following in the base directory:
bin/plmc -c example/RNA/RF00162.EC -o example/RNA/RF00162.params -a .ACGU -le 20.0 -lh 0.01 -m 50 example/RNA/RF00162.fasta
To plot the aggregated coupling scores type the following in MATLAB from the scripts
directory:
params = read_params('../example/RNA/RF00162.params');
plot_coupling_scores(params)
The default alphabet (-ACDEFGHIKLMNPQRSTVWY) can be overridden by the option -a ALPHABET
. All uppercase letters and non-letter ASCII characters are acceptable. Lowercase letters in the alignment file will be mapped to the corresponding uppercase letter, except for in focus mode -f
, in which columns that are lower case in the focus sequence will be discarded.
Physical systems. As an example, simulated draws from a 3-state, 1-dimensional Potts model are provided in the examples folder and encoded by the characters _, *, and ^. The following command would estimate the parameters by running to convergence with λe = 1.0, λh = 1.0 and sequence reweighting disabled:
bin/plmc -c example/potts/potts3.txt -a _*^ -t -1 -le 1.0 -lh 1.0 example/potts/potts3.a2m
A 1D Potts model will only have interactions between i -> i + 1, which should be evident in the coupling summary scores output to example/potts/potts3.txt
The use of pseudolikelihood approximations for approximate inference of spin models for sequence families was established in the compsci/bioinformatics literature by
Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S. I., & Langmead, C. J. (2011). Learning generative models for protein fold families. Proteins: Structure, Function, and Bioinformatics, 79(4), 1061-1078.
and in the biophysics literature by
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E, 87(1), 012707.
plmc
implements a joint optimization version of inference described in these and subsequent works (i.e. the 'symmetric' pseudolikelihood). If you'd like to use plmc
in your own work, please cite the following paper. Also, please let us know if you have any comments or questions!
Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P., Springer, M., Sander, C., & Marks, D. S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35(2), 128-135.
plmc was written by John Ingraham in Debora Marks' lab at Harvard Medical School
This code uses a C implementation of L-BFGS by Naoaki Okazaki and Nishimura and Matsumoto's Mersenne Twister in C, which are included in this repo.