PhyloAcc

This is a software to detect the shift pattern of DNA substitution rate of a genomic region and identify genomic elements accelerated in some specific species from a set of conserved elements. The underlying model assumes a latent discrete state (Z) of relative substitution rate of each branch on the phylogeny which can be neutral, conserved and accelerated. For each genomic element, it will start from a neutral rate at the common ancestor of the phylogeny, transit to conserved state and then reach an accelerated state in some lineages. Our method utilizes adaptive collapsed Gibbs sampling to obtain the pattern of substitution rate shifts (posterior distribution of Z) as well as relative substitution rates of conserved and accelerated state. To identify DNA elements with accelerating on specific branches, we compared marginal likelihood under three models: null model (M0) where all branches are neutral or conserved; accelerated model (M1) in which branches leading to target species are accelerated; and full model (M2), with no constraints on latent states. Then we used two Bayes factors: between M1 and M0 (BF1) and between M1 and M2 (BF2) as criteria to identify DNA elements accelerated exclusively in target lineages.

Getting Started

Some preliminary inputs which might be generated from other software are required: (1) a Phylogeny in mod format. The file can be output from phyloFit in PHAST package, with the transition rate matrix of bases and branch lengths. Our model assumes that those are the substitution rates under neutral evolution and will estimate conserved and accelerated rate relative to the neutral rate. In our study, we used genome-wide four-fold degenerate sites to estimate the rate matrix and branch lengths. (2) a multiple alignment file concatenating sequences of all input conserved elements in FASTA format, and (3) a bed file with the position of each individual element in the coordinate of concatenated alignment file (0-based).

We also need a parameter file, which contains the pathes for input files and output directory, information of species and parameters for MCMC. Please read README_PARAMETER.md for more detail.

After running the algorithm, our method will output the posterior of latent state (Z) for each branch (indicating neutral, conserved or accelerated) for each element under each model in the files "prefix_rate_postZ_[0-2].txt" and the marginal loglikelihoods for each element are in the file "prefix_elem_lik.txt". The format of output files are explained in README_OUTPUT.md.

Prerequisites

GCC: You might need latest GCC (verion 7) supporting openmp. If you are using Mac, you could use brew to (re)install gcc.

brew update ## update the formulae and Homebrew itself, if your brew is out-dated
brew install gcc

GNU Scientific Library (GSL): a numerical library for C and C++. PhyloAcc has been tested with version 2.4 of GSL.
Armadillo: C++ linear algebra library. You could install Armadillo following the steps on its website. For Linux, before installing Armadillo, you need to install CMAKE, LAPACK, BLAS (OPENBLAS) and ATLAS, along with the corresponding development/header files. PhyloAcc has been tested with verison 8.100.1. For Mac, you could use brew (tested and Recommended):

brew install homebrew/science/armadillo

Open MP: for parallel computing.
To use the R functions to plot, please install Rstudio with current version of R (>=3.3.2) and install seqinr, ggplot2, reshape2, ape packages.

Build on Linux or Mac

Run:

make

in PhyloAcc directory to generate the 'PhyloAcc' excutable.

Installation

Run:

sudo make install

to install in default path /usr/local/bin, and

sudo make uninstall

to uninstall.

For our extended version modeling GC-based gene conversion (gBGC) effect, please go to V2_GBGC/ and make & install under that directory.

Usage

Try this in PhyloAcc directory as a test, which will run simulated elements on ratite phylogeny:

mkdir Simulation_ratite/result_tmp
./PhyloAcc Simulation_ratite/param2-6-test.txt

or this after installation:

PhyloAcc Simulation_ratite/param2-6-test.txt

For testing propose, it will only run the first 10 elements of simulated data from Simulation_ratite/simu_500_200_diffr_2-6.* and output to Simulation/result_tmp/. To run the all elements, which will generate results in Simulation_ratite/result/, you could run:

./PhyloAcc Simulation_ratite/param2-6.txt

Again, it will output to Simulation/result_tmp/. To run your own data, please change the paths in your parameter file.

To run the model including gBGC,

mkdir Simulation/result_tmp
./PhyloAcc_gBGC paramGC-0.txt

under V2_GBGC/. It will output to V2_GBGC/Simulation/result_tmp/ (which will be the same as in V2_GBGC/Simulation/result/).

There are several R scripts available in R/ which read the output from PhyloAcc and generate plots in the main paper (e.g. "scaled" phylogenetic tree and sequence alignment for one element). Please read plot.html and run plot.Rmd for detail. R scripts to generate simulated DNA sequences are also in R/, please see Simulation.md for more detail.

Data

The species names and phylogenetic trees used in the PhyloAcc manuscript are in Data/. The simulated sequences and results are in Simulation_mammal/ and Simulation_ratite/.

tsackton / PhyloAcc