EricArcher / ebvSim

GEO-BON Genetics Working Group simulation based evaluations of Essential Biodiversity Variables (EBVs)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ebvSim

Description

ebvSim is a package for simulating SNP data and testing the performance of various genetic diversity metrics. The package will simulate replicate data for multiple scenarios. It is a wrapper for the coalescent simulator fastsimcoal2 run through the strataG package, which then runs a few generations of a forward simulator, rmetasim, initialized with allele frequencies from the genotypes output by fastsimcoal2.


Installation

1. Download and install fastsimcoal2 so that it can be executed on the command line from any location. This requires it to be in a folder that is somewhere in the execution PATH. The test is to try to execute fastsimcoal (fsc26) from any folder. Here are sites with guidance for setting the path for different operating systems: Windows, Mac OSX, or LINUX. On Mac OS or LINUX/UNIX systems, executables can be placed in the /usr/local/bin folder which is usually a default in the PATH.

2. Install the ebvSim package from GitHub with:

devtools::install_github("ericarcher/ebvSim", dependencies = TRUE, force = TRUE)

This should also install strataG and rmetasim from their GitHub repositories. If these packages are not available after the ebvSim installation, install them from:

devtools::install_github("ericarcher/strataG", dependencies = TRUE, force = TRUE)
devtools::install_github("stranda/rmetasim", dependencies = TRUE, force = TRUE)

3. By default, rmetasim can only simulate a maximum of 1001 loci. If this needs to be increased, it can be done so by changing this constant and recompiling. Instructions for this can be found here.


Select simulation scenarios

The code is designed to run multiple replicates of a set of demographic scenarios. The scenarios are defined by rows in data frames that are included in the package. The first set of scenarios is called trial.1, and has the following columns:

  • num.pops: the number of populations.
  • Ne: the effective population size.
  • num.samples: the number of samples to simulate. Must be <= Ne. If NA, then num.samples = Ne.
  • mig.rate: the migration rate specified as proportion of population migrating per generaton.
  • mig.type: the type of of migration matrix structure. "island" = rate between all populations is the same. "stepping.stone" = migration only occurs between neighboring populations. Metapopulation is ring shaped, not a linear chain.
  • dvgnc.time: the number of generations since divergence of the populations.
  • marker.type: type of marker to simulate. At this point, only "snp" is available.
  • mut.rate: mutation rate (# of mutations per generation) of the markers to simulate.
  • num.loci: number of independent loci to simulate.
  • ploidy: the ploidy of the markers to be simulated. Set to 2 for diploid.
  • rmetasim.ngen: the number of generations to run Rmetasim for. Set to 0 to skip Rmetasim.

For the first set of trials, we would like people to sign up to run 100 replicates of each scenario on the google spreadsheet.


Run the simulations

To run a specific set of scenarios, the simulation can be run with the runEBVsim() function:

rm(list = ls())
library(ebvSim)
data(trial.1)

# create a vector of specific scenarios
i <- c(1, 5, 10, 12, 20)

runEBVsim(
  label = "EIA_trial.1_1",
  scenarios = trial.1[i, ],
  num.rep = 10,
  num.cores = 4
)

The parameters for runEBVsim() are:

  • label: text to use to label this set of of scenarios. The format should be "initials_trial.name_attempt.num". attempt.num can be a number you use to separate different attempts if you're doing either multiple trials or multiple runs of the same set of scenarios on different systems.
  • scenarios: the data frame of scenarios to run.
  • num.rep: the number of replicates to simulate for each scenario.
  • num.cores: the number of cores to use. Replicates for all scenarios are load balanced amongst the number of cores selected. That is, as cores are freed, the next replicates will be allocated to those cores, so that cores are always working until there are fewer than num.cores replicates left. If this parameter is set to a value greater than 1, progress notifications will not be printed on the console.

NOTE: Some scenarios can take a lot of memory and a long time to run. Resource use scales proportional to the number of individuals (Ne * num.pops). On many systems, multiple cores will share the available memory. Thus, increasing the number of cores to use has the potential to exponentially increase memory usage and cause system crashes. If several large scenarios are in the group being run, it is suggested to use fewer cores and let the simulation run longer. These choices will be system dependant and it suggested that a few test runs be done with a small number of replicates to ensure that crashes do not occur.

If you have an error, check your folder for files that end in "_ERROR.ext". If they exist, send them to Eric Archer along with a zip of the completed scenario data. If the system crashes due to memory problems, these files will probably not be produced. In that case, reduce the number of cores being used and try to run again.

If errors or crashes occur, do not delete the scenario replicates that have already run - they can still be used. Just reduce the number of new replicates that are being run and change the attempt number in <label>.


Upload simulation results

When the simulations are complete, there will be three new items in the working directory:

  • <label>_scenario.replicates: a folder containig .csv files of genotypes for each scenario replicate.
  • <label>_scenarios.csv: a .csv file of the scenario specifications.
  • <label>_params.rdata: an R workspace file containing a a list called params that contains the parameters used to to run the scenarios and a summary of the scenario replicates, with their start and stop times, the total run time, and the filename of genotypes created.

The runEBVsim() function also invisibly returns the same summary list contained in <label>_params.rdata.

When the run is complete, compress the folder of results along with the "_params.rdata" file and upload them to the Google Drive folder here. Name the compressed file "<label>_results.tar.gz" (or .zip, or whatever compression algorithm you use).


Contact

About

GEO-BON Genetics Working Group simulation based evaluations of Essential Biodiversity Variables (EBVs)


Languages

Language:R 100.0%