SMC++ is a program for estimating the size history of populations from whole genome sequence data.
Download a pre-compiled binary for the latest version of SMC++ from the releases page. Alternatively, build the software yourself by following the Build instructions.
Convert your VCF(s) to the SMC++ input format with vcf2smc:
$ smc++ vcf2smc my.data.vcf.gz out/chr1.smc.gz chr1 Pop1:S1,S2
This command will parse data for the contig
chr1
for samplesS1
andS2
which are members of populationPop1
. You should run this once for each independent contig in your dataset, producing one SMC++ output file per contig.Fit the model using estimate:
$ smc++ estimate --theta .00025 -o analysis/ out/example.chr*.smc.gz
Depending on sample size and your machine, the fitting procedure should take between a few minutes and a few hours. The fitted model will be stored in JSON format in
analysis/model.final.json
. For details on the format, see below.Visualize the results using plot:
$ smc++ plot analysis/model.final.json plot.pdf
SMC++ requires the following libraries and executables in order to run:
- Python 3.0 or greater.
- A C++-11 compiler (gcc 4.8 or later, for example).
- gmp, for some rational field computations.
- mpfr, for some extended precision calculations.
- gsl, the GNU Scientific Library.
On Ubuntu (or Debian) Linux, the library requirements may be installed using the commmand:
$ sudo apt-get install -y libgmp-dev libmpfr-dev libgsl0-dev
On OS X, the easiest way to install them is using Homebrew:
$ brew install mpfr gmp gsl
After installing the requirements, SMC++ may be installed by running:
$ python setup.py install
in the top-level directory.
Versions of Clang shipping with Mac OS X do not currently support
OpenMP. SMC++ will be much faster if it runs in multithreaded mode
using OpenMP. For this reason it is recommended that you use gcc
instead. In order to do so, set the CC
and CXX
environment
variables, e.g.:
$ export CC=gcc-5 CXX=g++-5 $ CC=gcc-5 CXX=g++-5 python setup.py install
SMC++ pulls in a fair number of Python dependencies. If you prefer to keep this separate from your main Python installation, or do not have root access on your system, you may wish to install SMC++ inside of a virtual environment. To do so, first create and activate the virtual environment:
$ virtualenv -p python3 <desired location> $ source <desired location>/bin/active
Then, install SMC++ as described above.
SMC++ comprises several subcommands which are accessed using the syntax:
$ smc++ <subcommand>
This subcommand converts (biallelic, diploid) VCF data to the format used by SMC++.
- An indexed VCF file.
- An output file. Appending the
.gz
extension will cause the output to be compressed; the estimate command can read from both compressed and uncompressed data sources. - A contig name. Each call to vcf2smc processes a single contig. VCFs containing multiple contigs should be processed via multiple separate runs.
- A list of population(s) and samples. Each population has an id followed by a comma-separated list of sample IDs (column names in the VCF). Up to two populations are supported.
For example, to convert contig chr1
of vcf.gz
using samples
NA12878
and NA12879
of population CEU
, saving to
chr1.smc.gz
, use:
$ smc++ vcf2smc vcf.gz chr1.smc.gz chr1 CEU:NA12878,NA12879
-d
. SMC++ relies crucially on the notion of a pair of distinguished lineages (see paper for details on this terminology). The identity of the distinguished lineages is set using the-d
option, which specifies the sample(s) which will form the distinguished pair.-d
accepts to sample ids. The first allele will be taken from sample 1 and the second from sample 2. To form the distinguished pair using one haplotype from each ofNA1287{8,9}
using the above example:$ smc++ vcf2smc -d NA12878 NA12879 vcf.gz chr1.smc.gz chr1 CEU:NA12878,NA12879
Note that "first" and "second" allele have no meaning for unphased data!
By varying
-d
over the same VCF, the user can create distinct data sets for estimation. This is useful for forming composite likelihoods. For example, the following command will create three data sets from contigchr1
ofmyvcf.gz
, by varying the identity of the distinguished individual and treating the remaining two samples as "undistinguished":for i in {7..9}; do smc++ vcf2smc -d NA1287$i NA1287$i myvcf.gz out.$i.txt chr1 NA12877 NA12878 NA12890; done
vcf2smc
targets a common use-case but may not be sufficient for all
users. Those wishing to implement their own custom conversion to the SMC
data format should see the input data format description below.
This command will fit a population size history to data. The basic usage is:
$ smc++ estimate -o out data.smc.gz
-o
specifies the directory to store the final estimates as well as all intermediate files and debugging output.--theta
sets the population-scaled mutation rate, that is 2 N_0 \mu where \mu denotes the per-generation mutation rate, and N_0 is the baseline diploid effective population size (see--N0
, below). If-theta
is not specified, Watterson's estimator will be used. It is recommended to set this using prior knowledge of \mu if at all possible.--rho
sets the population-scaled recombination rate, that is 2 N_0 r where r denotes the per-generation recombination rate. If not specified, this will be estimated from the data. The estimates should be fairly accurate if the recombination rate is not large compared to the mutation rate.
A number of other arguments concerning technical aspects of the fitting procedure
exist. To see them, pass the -h
option to estimate
.
This command plots fitted size histories. The basic usage is:
$ smc++ plot plot.png model1.json model2.json [...] modeln.json
where model*.json
are fitted models produced by estimated
.
-g
sets the generation time (in years) used to scale the x-axis. If not given, the plot will be in coalescent units.--logy
plots the y-axis on a log scale.-c
produces a CSV-formatted table containing the data used to generate the plot.
This command fits two-population split models using marginal estimates produced by estimate.
The data files should be ASCII text and can optionally be gzipped. The format of each line of the data file is as follows:
<span> <d> <u1> <n1> [<u2> <n2>]
Explanation of each column:
span
gives the number of contiguous bases at which this observation occurred. Hence, it will generally be1
for SNPs and greater than one for a stretch of nonsegregating sites.d
Gives the genotype (0
,1
, or2
) of the distinguished individual. If the genotype of the distinguished individual is not known, this should be set to-1
.- The next column
u1
is the total number of derived alleles found in the remainder of the (undistinguished) sample at the site(s).- The final column
n1
is the haploid sample size (number of non-missing observations) in the undistinguished portion of the sample.- If two populations are to be analyzed,
u2
andn2
are also specified for the second population.
For example, consider the following set of genotypes at a set of 10 contiguous bases on three diploid individuals in one population:
dist. ..1..N...2 .....N...1 2N....+...
The distinguished individual is row one. A .
indicates that the
individual is homozygous for the ancestral allele, while an integer
indicates that that individual possesses (1,2)
copies of the derived
allele. An N
indicates a missing genotype at that position. Finally,
the +
in column seven indicates that individual three possessed the
dominant allele on one chromosome, and had a missing observation on the
other chromosome (this would be coded as 0/.
in a VCF).
The SMC++ format for this input file is:
1 0 2 4 1 0 0 2 1 1 0 4 2 0 0 4 1 -1 0 2 1 0 0 3 2 0 0 0 1 2 1 4
Upon completion, SMC++ will write a JSON-formatted model file into the into the analysis directory. The file is human-readable and contains various parameters related to the fitting procedure.