caballero / FakeFamily

Simulation of family genome sequencing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FAMILY GENOME SIMULATOR

Copyright (C) 2011 Juan Caballero [Insitute for Systems Biology]

This is a basic pedigree simulation program, created for the Family Genomics Group at the Institute for Systems Biology (http://familygenomics.systemsbiology.net/).

QUICK TUTORIAL
# STEP 1. Create parental genomes
bin/personalGenome.pl -g data/hg19.fa.gz -d data/dbSNP132.var.gz -p data/CEU.var.gz -o father.var -s M
bin/personalGenome.pl -g data/hg19.fa.gz -d data/dbSNP132.var.gz -p data/CEU.var.gz -o mother.var

# STEP 2. Artificial sex
bin/reproduction.pl -f father.var -m mother.var -c girl1.var -r data/hg19.fa.gz
bin/reproduction.pl -f father.var -m mother.var -c boy2.var -s M
cp boy2.var boy3.var # identical twin

# STEP 3. Add noise/error
bin/noise.pl -i father.var -o father.var.data -s M -g data/hg19.fa.gz
bin/noise.pl -i mother.var -o mother.var.data -s F -g data/hg19.fa.gz
bin/noise.pl -i  girl1.var -o  girl1.var.data -s F -g data/hg19.fa.gz
bin/noise.pl -i   boy1.var -o   boy1.var.data -s M -g data/hg19.fa.gz
bin/noise.pl -i   boy2.var -o   boy2.var.data -s M -g data/hg19.fa.gz

# Complete documentation
# perldoc PROGRAM
# PROGRAM -h
# PROGRAM --help

DESCRIPTION

The simulation occurs in three steps (programs):

1. Creating a personal genome (personalGenome.pl): three sources are used for the variation: 1000genomes population variations data, dbSNP and the reference genome. The total number of variations (SNPs for now) is an expected number (~3 mill) with a little noise, or you can fix the number. De novo SNPs can be added (default 1%). You can choose the sex, but for now there is not modeling in recombination points in sexual chromosomes. In the first round it selects population-specific SNPs (~ 1mill because data comes from SNP-chips), the second round take data from dbSNP and in the last round de novo SNPs are created. The program integrates diverse options and configurations (no population SNPs, no de novo, homozygous SNPs, etc) to be as flexible as possible. The output is a text table with columns:
CHROM POSITION REF_ALLELE ALLELE_1 ALLELE_2 ALLELE_INFO

2. Artificial Sex (reproduction.pl): this script takes a personal genome from a "father" and a "mother", performs meiosis in each genome based with a distribution of recombination points and mix together chromosomes to produce an "offspring". You can define the sex, total SNPs, de novo SNPs and other parameters. Real genome data can be used if it's correctly formated. The output is the same:
CHROM POSITION REF_ALLELE ALLELE_1 ALLELE_2 ALLELE_INFO

3. Noise/Error addition (noise.pl): this part models missing SNPs, bad base calling, half-calling and false positives SNPs as error/noise.

There are some optimization that can be added, but each programs uses a lot of memory (~4-5 GB), because it needs to load the reference genome, but each program is relatively fast, 5 min to crete a genome and 5 min to create an offspring.

About

Simulation of family genome sequencing

License:GNU General Public License v3.0


Languages

Language:Perl 100.0%