MikeAxtell/sim_srna-seq

SYNOPSIS
    sim_sRNA-seq.pl

    Simulation of plant small RNA-seq data

    Copyright (C) 2014 Michael J. Axtell

AUTHOR
    Michael J. Axtell, Penn State University, mja18@psu.edu

LICENSE
    This program is free software: you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation, either version 3 of the License, or (at your
    option) any later version.

        This program is distributed in the hope that it will be useful,
        but WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
        GNU General Public License for more details.

        You should have received a copy of the GNU General Public License
        along with this program.  If not, see <http://www.gnu.org/licenses/>.

DEPENDENCIES
    perl 5 <www.perl.org> at /usr/bin/perl

    samtools <http://samtools.sourceforge.net/> in the PATH

INSTALL
    Ensure dependicies are present (see above). Add the script
    sim_sRNA-seq.pl to your PATH

USAGE
        sim_sRNA-seq.pl [options] -s soft-masked-genome.fa                                                                                        

    --- OR, LESS PREFERRED ---

        sim_sRNA-seq.pl [options] -d hard-masked-genome.fa -n not-masked-genome.fa      

OPTIONS
    Options:

    -s : path to soft-masked genome. Lower-case letters are assumed
    repeat-masked. If -s is specified, neither -d nor -n can be specified.

    -d : path to hard-masked genome. If -d is specified, -n must also be
    specified, and -s cannot be specified.

    -n : path to not-masked genome. If -n is specified, -d must also be
    specified, and -s cannot be specified.

    -v : print version number and quit.

    -h : print help message and quit.

    -r : desired number of reads, in millions. Default: 5.

    -e : per-read probability of a single nt sequencing error
    (substitution). Default: 1E-4.

METHODS
  Read numbers and types
    30% of the simulated reads will come from roughly 100 MIRNA loci, 5% of
    the simulated reads will come from roughly 20 tasiRNA/phased secondary
    siRNA loci, and the remaining 65% from roughly 10,000 heterochromatic
    siRNA loci.

    The abundance of reads from each locus is distributed on a log-linear
    scale (e.g., plotting the log10 of read number as a function of
    abundance rank yields a straight line).

  MIRNA simulation
    Valid MIRNA loci have no overlap with repeat-masked regions of the
    genome. The locus has a hypothetical size of 125nts, of which all must
    be ATGC bases (e.g. no ambiguous nts.).

    The strand of the MIRNA precursor is randomly selected, as is the arm
    from which the mature miRNA and star come from. There is no actual
    hairpin sequence necessarily present at MIRNA loci .. it is only the
    pattern of reads that is being simulated.

    The mature miRNA and mature miRNA* are defined as 'master' positions.
    The left-most 'master' position in a locus is a 21-mer starting at
    position 17 of the locus. The right-most 'master' position is a 21 mer
    at position 85. Assignment of the arm (i.e., whether the left-most or
    right-most 'master' position is the miRNA or miRNA*) is random at each
    locus.

    Once a locus has been found and 'master' positions defined, each read is
    simulated according to the following probabilities. In the following
    list, "miR" means mature miRNA, "star" means miRNA*. The numbers after
    each indicate the offset at 5' and 3' ends relative to the master
    positions. So, "miR0:1" means the mature miRNA sequence, starting at the
    master 5' end, and ending 1 nt after the master 3' end.

    60% miR0:0, 20% star0:0,, 4% miR0:-1, 1% miR0:-2, 4% miR1:1, 1% miR1:0,
    2% miR-1:-1, 0.5% miR-1:-2, 2% miR-2:-2, 0.5% miR-2:-3, 1% star0:-1,
    0.5% star0:-2, 1% star1:1, 0.5% star1:0, 2% star-1:-1.

    Sampling of simulated MIRNA-derived reads continues until the required
    number of reads for a particular locus is recovered.

  tasiRNA/phased siRNA simulation
    TAS loci have no overlap with repeat-masked areas of the genome. Each
    locus has a nominal size of 140nts. All bases must be ATGC
    (non-ambiguous).

    Each locus is simulated to be diced in 6 21 nt phases. At each phasing
    position, 21mers are the dominant size, with 20mers and 22mers being
    less frequent .. the 20 and 22nt variants vary in their 3' positions
    relative to the 'master' 21nt RNAs.

    Once a locus has been identified, and all possible 20, 21, and 22mers
    charted, each read is simulated according to the following
    probabilities:

    Strand of origin is 50% top, 50% bottom.

    Phase position is equal chance for all (e.g. 1/6 chance for any
    particular phase location).

    80% of the time, the 21mer is returned, 10% of the time the 20mer, and
    10% the 22 mer.

  Heterochromatic siRNA simulation
    Heterochromatic siRNA loci can come from anywhere in the genome
    (repeat-masked or un-masked), and their nominal locus size is 100nts.
    All nts in the 100 nt locus must be ATGC (no ambiguous positions).

    All possible start and stop positions for 24, 23, 22, and 21 nt RNAs are
    eligible from these loci. Each simulated read is identified with the
    following probabilities:

    50% chance for top of bottom strand origin

    5' position within the locus is random

    90% chance of a 24 mer, 5% chance of a 23 mer, 3% chance of a 22 mer,
    and 2% chance of a 21 mer.

  Simulation of sequencing errors
    Regardless of small RNA type, once a given read has been simulated,
    there is a chance of introucing a single-nucleotide substitution
    relative to the reference genome at a randomly selected position in the
    read. The chance of introducing this error is given by option -e
    (Default is 1 in 10,000).

  No overlapping loci
    None of the simulated loci are allowed to have any overlap with each
    other.

OUTPUT
  Files
    Two files are created in the working directory.

    simulated_sRNA-seq_reads_overview.txt : A tab-delimited text file giving
    the coordinates and names for each simulated locus.

    simulated_sRNA-seq_reads.fasta : A FASTA file of all of the simulated
    reads.

  Naming conventions
    The FASTA header of each simulated read encode the basic information
    about the read. Several different fields are separated by "_"
    characters. For instance the following read header
    ">MIRNA_2_2_chr7:123063894-123063914_+_0" means ..

    MIRNA: This came from a simulated MIRNA locus

    2: The 2nd simulated locus of this type

    2: Read number 2 from this locus

    chr7:123063894-123063914: The true origin of this read.

    +: The genomic strand of this read.

    0: The number of sequencing errors simulated into the read.
MikeAxtell / sim_srna-seq

About

Languages