percyfal / simseqgen

Simulate sequences from genealogies and corresponding read sequences

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

simseqgen - simulate reference sequences and reads

simseqgen provides wrapper scripts to simulate reference sequences and reads.

Usage

simseqgen subcommand

where subcommands are described below.

Subcommands

repo - list demographic models

simseqgen is packaged with some demographic models specified following the demes specification. The repo subcommand provides information about the models.

make_ref - generating reference sequences from tree sequences

The subcommand make_ref takes as input a tree sequence that has been generated by simulations with e.g. msprime or SLiM. The output consists of a vcf file with variant sites and a fasta file with all sequences.

If a reference individual is specified, a reference fasta file is also generated, and the variant calls of the other individuals will be modified with respect to the reference sequence. This can be useful for simulating outgroups.

WIP: By default, the fasta sequences are generated by filling the monomorphic sites with DNA drawn uniformly from the four bases. A reference sequence can be supplied, in which case only the variant sites will be modified.

sim_reads - simulate reads based on reference sequences

TODO.

Example simulation - Out of Africa with outgroups

This example simulates reference sequences based on an Out of Africa demes model with outgroups chimpanzee, gorilla, and orangutan.

simseqgen repo --ls

name                          uri
ooa_with_outgroup             /path/to/simseqgen/data/ooa_with_outgroup.demes.yaml

The demes file is first used as input to msprime to simulate ancestry

msp ancestry --demography /path/to/simseqgen/data/ooa_with_outgroup.demes.yaml CHB:6 YRI:6 CEU:7 chimpanzee:1 gorilla:1 orangutan:1 -o ooa.ts --recombination-rate 1e-8 --length 1e6 --random-seed 42

followed by the addition of mutations

msp mutations --random-seed 42 1.25e-9 ooa.ts -o ooa.mut.ts

Run simseqgen to generate vcf and fasta sequences for the tree sequence:

simseqgen make_ref ooa.mut.ts --reference_chromosome CEU:6

The output vcf will consist of 6 CEU samples as one sample has been chosen as reference. In addition, all derived alleles specific to the reference have been flipped. In this case three output files are generated: 1) simseqgen.reference.fasta containing the first haplotype of the reference individual 2) simseqgen.fasta containing the chromosomes from the other individuals and 3) simseqgen.vcf.gz containing variants for the non-reference individuals.

About

Simulate sequences from genealogies and corresponding read sequences

License:MIT License


Languages

Language:Python 85.5%Language:Makefile 14.5%