NateyJay / sRNA-lib-simulator

Generates simulated plant sRNA-seq data based on real input libraries

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sRNA-lib-simulator

Generates simulated plant sRNA-seq data based on real input libraries

This program is based on a script by my graduate advisor Mike Axtell: https://github.com/MikeAxtell/sim_srna-seq. Its purpose is to produce simulated plant sRNA-seq data sets as might be produced from short read sequencing on an Illumina-type system, but with a traceable origin. This is useful when assaying the quality of an aligner.

This script has several advantages over it's predecessor:

  • Simulations are based on real sRNA-seq libraries. This means that reads come from regions of the genome known to produce that size-class of sRNA - This rules out regions which do not produce sRNAs and might skew the simulation.
  • Output provides more files, including FASTA files of reads classified by their multi-mapping status.

This simulator was used extensively in the testing of ShortStack 3.x (https://github.com/MikeAxtell/ShortStack), which can be found in the publication: https://doi.org/10.1534/g3.116.030452.

Readme

SYNOPSIS
    sim_sRNA_library.py - v0.3

    simulation small-RNA seq libraries based on a template library

    Copyright (C) 2015  Nathan R. Johnson; Michael J. Axtell

AUTHORS
    Nathan R. Johnson, Penn State University, jax523@gmail.com
    Michael J. Axtell, Penn State Universtiy, mja18@psu.edu

LICENSE
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.


DEPENDENCIES (all in PATH)
    python 2.6.6
    samtools 1.1
    bowtie 1.0

USAGE
    sim_sRNA_library.py [options] -f sample_lib.fasta -m species_specific_miRNA_annotation.gff3 -g indexed_reference_genome
      --- OR ---
    sim_sRNA_library.py [options] -b sample_lib_alignment_(all).bam -m sim_sRNA_library_specific_miRNA_annotation.gff3 -g indexed_reference_genome
      --- OR ---
    simulator.py [options] -p psuedo_annotation.txt -g indexed_reference_genome

REQUIRED
    -f : fasta file of template sRNA library, already adapter trimmed.  This will be the basis for producing a psuedo_annotation
    -s : sam file of aligned sRNA library.  For proper use of the program, this should be produced using: bowtie --all --best --strata -v 0
    -m : annotation of known mature miRNAs for the species of interest, gff3 format.  Available from miRBase.org
    -p : location of psuedo_annotation file (.txt), generated by this program

OPTIONS
    -o : output identifier for psuedo_annotation and simulated library generation. Default: random.
    -v : print version number and quit
    -h : print help message and quit
    -r : desired total number of reads, in millions. Default: 5
    -e : per-read probability of a single nt sequencing error (substitution). Default: 1E-4
    --suppress_percent : when called, percent progress bars will not be displayed.

METHODS     

    Approach

        This method uses real sRNA-seq data as a template to generate simulated
        sequencing data of user-selectable number of reads. This has been
        developed to test the accuracy of small RNA alignment methods, as the
        known origin of a read allows a researcher to know the rate of
        misaligned reads.

    Genome

        There is no strict requirement for genomic masking in this program, but
        it is recommended to use non- or soft-masked genomes, as sRNAs may come
        from highly repetititve portions of the genome. It is imparative that
        the reference genome and mir annotation (.gff3) have the same convention
        for chromosome names.

        Additionally, the genome must be indexed for bowtie, using the bowtie-
        build module. This program expects bowtie indecies to be present in the
        reference genome folder, looking for the presence of the following file
        as proof:

        reference:   Osativa.fa ebwt proof:  Osativa.fa.1.ebwt

    Template annotation - miRNA

        Identification of candidate locations for miRNAs comes from annotation
        data in the form of a .gff3 file. These data are available for many
        species at mirbase.org.

    Template library - si, tasi, non-sRNA

        sRNA-seq data in fasta format is used as the template input. This
        library should already be adapter trimmed. The library is then aligned
        using bowtie --all --best --strata -v 0, reporting all alignments of
        every read and saving it as a .bam, binary alignment file. This file may
        be used for subsequent runs of the program to save alignment time.

        This template is used in the identification of candidate locations for
        heterochromatic siRNA, tasiRNA and garbage RNA (fragmented or non-sRNA).

    Pseudo-annotation construction

        Using the alignment and annotation data from the previous steps, the
        program constructs a list of candidate 'bins', 150 nt in length. All
        reads falling within a bin will be stored as a depth value for their
        given type. These types are defined simply by size, with siRNA
        encompassing 23 and 24 nt reads, tasiRNA encompassing 21 nt reads, and
        garbage RNA encompassing reads between 15 and 20 nt, as well as over 24
        nt in length.

        Once depths have been completely assigned, each bin will be chosen for
        only one type of small RNA.

        To be chosen for a given type, that type must have an 80% majority in
        that bin, giving expression level importance.  Any bins containing a
        miRNA read will be automatically chosen as miRNA, therefore discluded
        from the other types.  A minimum depth is also required for a bin to be
        chosen for a type. Min-depth is 1-5 reads deep, and will be selected
        automatically by the program, to choose the most strict requirement that
        still allows enough loci for the later simulation steps.  If a minimum
        depth of 1 does not allow enough loci for later steps, the program deems
        this library to be of too low quality to be an adequate template.

        Once all bins have been chosen for a type, their genomic locations are
        output to a .txt file known as a pseudo-annotation, which will be used
        for subsequent steps.  This file may be reused for repeated simulations,
        as it is non-probabalistic for a given library.

    Read numbers and types

        30% of the simulated reads will come from roughly 100 MIRNA loci, 5%
        of the simulated reads will come from roughly 20 tasiRNA/phased
        secondary siRNA loci, and the remaining 65% from roughly 10,000
        heterochromatic siRNA loci.

        The abundance of reads from each locus is distributed on a log-linear
        scale (e.g., plotting the log10 of read number as a function of
        abundance rank yields a straight line).

    MIRNA simulation

        Valid MIRNA loci are selected from annotated MIRNA loci, based on the
        'pseudo-annotation'. The locus has a hypothetical size of 125nts.

        The strand of the MIRNA precursor is randomly selected, as is the arm
        from which the mature miRNA and star come from. There is no actual
        hairpin sequence necessarily present at MIRNA loci .. it is only the
        pattern of reads that is being simulated.

        The mature miRNA and mature miRNA* are defined as 'master' positions.
        The left-most 'master' position in a locus is a 21-mer starting at
        position 17 of the locus. The right-most 'master' position is a 21 mer
        at position 85. Assignment of the arm (i.e., whether the left-most or
        right-most 'master' position is the miRNA or miRNA*) is random at each
        locus.

        Once a locus has been found and 'master' positions defined, each read is
        simulated according to the following probabilities. In the following
        list, "miR" means mature miRNA, "star" means miRNA*. The numbers after
        each indicate the offset at 5' and 3' ends relative to the master
        positions. So, "miR0:1" means the mature miRNA sequence, starting at the
        master 5' end, and ending 1 nt after the master 3' end.

        60%  miR0:0, 20% star0:0,, 4% miR0:-1, 1% miR0:-2, 4% miR1:1, 1%
        miR1:0, 2% miR-1:-1, 0.5% miR-1:-2, 2% miR-2:-2, 0.5% miR-2:-3, 1%
        star0:-1, 0.5% star0:-2, 1% star1:1, 0.5% star1:0, 2% star-1:-1.

        Sampling of simulated MIRNA-derived reads continues until the required
        number of reads for a particular locus is recovered.

    tasiRNA/phased siRNA simulation

        TAS loci are selected from bins chosen in the 'pseudo-annotation'. Each
        locus has a nominal size of 140nts.

        Each locus is simulated to be diced in 6 21 nt phases. At each phasing
        position, 21mers are the dominant size, with 20mers and 22mers being
        less frequent .. the 20 and 22nt variants vary in their 3' positions
        relative to the 'master' 21nt RNAs.

        Once a locus has been identified, and all possible 20, 21, and 22mers
        charted, each read is simulated according to the following
        probabilities:

        Strand of origin is 50% top, 50% bottom.

        Phase position is equal chance for all (e.g. 1/6 chance for any
        particular phase location).

        80% of the time, the 21mer is returned, 10% of the time the 20mer, and
        10% the 22 mer.

    Heterochromatic siRNA simulation

        Heterochromatic siRNA loci are selected from bins chosen in the 'pseudo-
        annotation'. Their nominal locus size is 200-1000 nts, logarithmically
        weighted to smaller loci.

        Precursors are simulated as 35-60nt regions, logarithmically weighted to
        smaller precursors. Eligible products from a loci are 21-24nt in length,
        coming from either strand or end of the precursor. eligible from these
        loci. Each simulated read is identified with the following
        probabilities:

        50% chance for top of bottom strand origin.

        50% chance for left or right end of precursor.=.

        90% chance of a 24 mer, 5% chance of a 23 mer, 3% chance of a 22 mer,
        and 2% chance of a 21 mer.

    Simulation of sequencing errors

        Regardless of small RNA type, once a given read has been simulated,
        there is a chance of introucing a single-nucleotide substitution
        relative to the reference genome at a randomly selected position in the
        read. The chance of introducing this error is given by option -e
        (Default is 1 in 10,000).

    No overlapping loci

        None of the simulated loci are allowed to have any overlap with each
        other.

    Filtering of sequences

        Simple filtering is applied to remove sequences that are ambiguous or
        easily identifiable as highly repetitive.  Ambiguous sequences are said
        to contain any "N" bases (non-ATGC). Dinucleotide repeats are filtered
        and defined as 8 uninterrupted pairs of the same 2 bases, ex.
        'ATATATATATATATAT'.  No other sequences are filtered, to avoid bias in
        selecting sequences.

OUTPUT   

    Files

        Four files are created in the working directory.  Option -o defines the
        [out] character.

        sim_[out].bam : A binary alignment file of the input template library.

        pseudo_[out].txt : A pseudo-annotation text file, identifying bins for
        simulated reads.

        sim_[out].txt : A tab-delimited text file giving the coordinates and
        names for each simulated locus.

        sim_[out].fa : A FASTA file of all of the simulated reads.

        [input_fasta]_mmap.fa : A FASTA file containing reads which were found to multi-map.

        [input_fasta]_unique.fa : A FASTA file containg uniquely mapping reads.

        [input_fasta]_unaligned.fa : A FASTA file containing reads which failed to align to the genome.

    Naming conventions

        The FASTA header of each simulated read encode the basic information
        about the read. Several different fields are separated by "_"
        characters. For instance the following read header
        ">MIRNA_2_2_chr7:123063894-123063914_+_0" means ..

        MIRNA: This came from a simulated MIRNA locus

        2: The 2nd simulated locus of this type

        2: Read number 2 from this locus

        chr7:123063894-123063914: The true origin of this read.

        +: The genomic strand of this read.

        0: The number of sequencing errors simulated into the read.

About

Generates simulated plant sRNA-seq data based on real input libraries

License:GNU General Public License v3.0


Languages

Language:Python 100.0%