Generates simulated plant sRNA-seq data based on real input libraries
This program is based on a script by my graduate advisor Mike Axtell: https://github.com/MikeAxtell/sim_srna-seq. Its purpose is to produce simulated plant sRNA-seq data sets as might be produced from short read sequencing on an Illumina-type system, but with a traceable origin. This is useful when assaying the quality of an aligner.
This script has several advantages over it's predecessor:
- Simulations are based on real sRNA-seq libraries. This means that reads come from regions of the genome known to produce that size-class of sRNA - This rules out regions which do not produce sRNAs and might skew the simulation.
- Output provides more files, including FASTA files of reads classified by their multi-mapping status.
This simulator was used extensively in the testing of ShortStack 3.x (https://github.com/MikeAxtell/ShortStack), which can be found in the publication: https://doi.org/10.1534/g3.116.030452.
SYNOPSIS
sim_sRNA_library.py - v0.3
simulation small-RNA seq libraries based on a template library
Copyright (C) 2015 Nathan R. Johnson; Michael J. Axtell
AUTHORS
Nathan R. Johnson, Penn State University, jax523@gmail.com
Michael J. Axtell, Penn State Universtiy, mja18@psu.edu
LICENSE
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
DEPENDENCIES (all in PATH)
python 2.6.6
samtools 1.1
bowtie 1.0
USAGE
sim_sRNA_library.py [options] -f sample_lib.fasta -m species_specific_miRNA_annotation.gff3 -g indexed_reference_genome
--- OR ---
sim_sRNA_library.py [options] -b sample_lib_alignment_(all).bam -m sim_sRNA_library_specific_miRNA_annotation.gff3 -g indexed_reference_genome
--- OR ---
simulator.py [options] -p psuedo_annotation.txt -g indexed_reference_genome
REQUIRED
-f : fasta file of template sRNA library, already adapter trimmed. This will be the basis for producing a psuedo_annotation
-s : sam file of aligned sRNA library. For proper use of the program, this should be produced using: bowtie --all --best --strata -v 0
-m : annotation of known mature miRNAs for the species of interest, gff3 format. Available from miRBase.org
-p : location of psuedo_annotation file (.txt), generated by this program
OPTIONS
-o : output identifier for psuedo_annotation and simulated library generation. Default: random.
-v : print version number and quit
-h : print help message and quit
-r : desired total number of reads, in millions. Default: 5
-e : per-read probability of a single nt sequencing error (substitution). Default: 1E-4
--suppress_percent : when called, percent progress bars will not be displayed.
METHODS
Approach
This method uses real sRNA-seq data as a template to generate simulated
sequencing data of user-selectable number of reads. This has been
developed to test the accuracy of small RNA alignment methods, as the
known origin of a read allows a researcher to know the rate of
misaligned reads.
Genome
There is no strict requirement for genomic masking in this program, but
it is recommended to use non- or soft-masked genomes, as sRNAs may come
from highly repetititve portions of the genome. It is imparative that
the reference genome and mir annotation (.gff3) have the same convention
for chromosome names.
Additionally, the genome must be indexed for bowtie, using the bowtie-
build module. This program expects bowtie indecies to be present in the
reference genome folder, looking for the presence of the following file
as proof:
reference: Osativa.fa ebwt proof: Osativa.fa.1.ebwt
Template annotation - miRNA
Identification of candidate locations for miRNAs comes from annotation
data in the form of a .gff3 file. These data are available for many
species at mirbase.org.
Template library - si, tasi, non-sRNA
sRNA-seq data in fasta format is used as the template input. This
library should already be adapter trimmed. The library is then aligned
using bowtie --all --best --strata -v 0, reporting all alignments of
every read and saving it as a .bam, binary alignment file. This file may
be used for subsequent runs of the program to save alignment time.
This template is used in the identification of candidate locations for
heterochromatic siRNA, tasiRNA and garbage RNA (fragmented or non-sRNA).
Pseudo-annotation construction
Using the alignment and annotation data from the previous steps, the
program constructs a list of candidate 'bins', 150 nt in length. All
reads falling within a bin will be stored as a depth value for their
given type. These types are defined simply by size, with siRNA
encompassing 23 and 24 nt reads, tasiRNA encompassing 21 nt reads, and
garbage RNA encompassing reads between 15 and 20 nt, as well as over 24
nt in length.
Once depths have been completely assigned, each bin will be chosen for
only one type of small RNA.
To be chosen for a given type, that type must have an 80% majority in
that bin, giving expression level importance. Any bins containing a
miRNA read will be automatically chosen as miRNA, therefore discluded
from the other types. A minimum depth is also required for a bin to be
chosen for a type. Min-depth is 1-5 reads deep, and will be selected
automatically by the program, to choose the most strict requirement that
still allows enough loci for the later simulation steps. If a minimum
depth of 1 does not allow enough loci for later steps, the program deems
this library to be of too low quality to be an adequate template.
Once all bins have been chosen for a type, their genomic locations are
output to a .txt file known as a pseudo-annotation, which will be used
for subsequent steps. This file may be reused for repeated simulations,
as it is non-probabalistic for a given library.
Read numbers and types
30% of the simulated reads will come from roughly 100 MIRNA loci, 5%
of the simulated reads will come from roughly 20 tasiRNA/phased
secondary siRNA loci, and the remaining 65% from roughly 10,000
heterochromatic siRNA loci.
The abundance of reads from each locus is distributed on a log-linear
scale (e.g., plotting the log10 of read number as a function of
abundance rank yields a straight line).
MIRNA simulation
Valid MIRNA loci are selected from annotated MIRNA loci, based on the
'pseudo-annotation'. The locus has a hypothetical size of 125nts.
The strand of the MIRNA precursor is randomly selected, as is the arm
from which the mature miRNA and star come from. There is no actual
hairpin sequence necessarily present at MIRNA loci .. it is only the
pattern of reads that is being simulated.
The mature miRNA and mature miRNA* are defined as 'master' positions.
The left-most 'master' position in a locus is a 21-mer starting at
position 17 of the locus. The right-most 'master' position is a 21 mer
at position 85. Assignment of the arm (i.e., whether the left-most or
right-most 'master' position is the miRNA or miRNA*) is random at each
locus.
Once a locus has been found and 'master' positions defined, each read is
simulated according to the following probabilities. In the following
list, "miR" means mature miRNA, "star" means miRNA*. The numbers after
each indicate the offset at 5' and 3' ends relative to the master
positions. So, "miR0:1" means the mature miRNA sequence, starting at the
master 5' end, and ending 1 nt after the master 3' end.
60% miR0:0, 20% star0:0,, 4% miR0:-1, 1% miR0:-2, 4% miR1:1, 1%
miR1:0, 2% miR-1:-1, 0.5% miR-1:-2, 2% miR-2:-2, 0.5% miR-2:-3, 1%
star0:-1, 0.5% star0:-2, 1% star1:1, 0.5% star1:0, 2% star-1:-1.
Sampling of simulated MIRNA-derived reads continues until the required
number of reads for a particular locus is recovered.
tasiRNA/phased siRNA simulation
TAS loci are selected from bins chosen in the 'pseudo-annotation'. Each
locus has a nominal size of 140nts.
Each locus is simulated to be diced in 6 21 nt phases. At each phasing
position, 21mers are the dominant size, with 20mers and 22mers being
less frequent .. the 20 and 22nt variants vary in their 3' positions
relative to the 'master' 21nt RNAs.
Once a locus has been identified, and all possible 20, 21, and 22mers
charted, each read is simulated according to the following
probabilities:
Strand of origin is 50% top, 50% bottom.
Phase position is equal chance for all (e.g. 1/6 chance for any
particular phase location).
80% of the time, the 21mer is returned, 10% of the time the 20mer, and
10% the 22 mer.
Heterochromatic siRNA simulation
Heterochromatic siRNA loci are selected from bins chosen in the 'pseudo-
annotation'. Their nominal locus size is 200-1000 nts, logarithmically
weighted to smaller loci.
Precursors are simulated as 35-60nt regions, logarithmically weighted to
smaller precursors. Eligible products from a loci are 21-24nt in length,
coming from either strand or end of the precursor. eligible from these
loci. Each simulated read is identified with the following
probabilities:
50% chance for top of bottom strand origin.
50% chance for left or right end of precursor.=.
90% chance of a 24 mer, 5% chance of a 23 mer, 3% chance of a 22 mer,
and 2% chance of a 21 mer.
Simulation of sequencing errors
Regardless of small RNA type, once a given read has been simulated,
there is a chance of introucing a single-nucleotide substitution
relative to the reference genome at a randomly selected position in the
read. The chance of introducing this error is given by option -e
(Default is 1 in 10,000).
No overlapping loci
None of the simulated loci are allowed to have any overlap with each
other.
Filtering of sequences
Simple filtering is applied to remove sequences that are ambiguous or
easily identifiable as highly repetitive. Ambiguous sequences are said
to contain any "N" bases (non-ATGC). Dinucleotide repeats are filtered
and defined as 8 uninterrupted pairs of the same 2 bases, ex.
'ATATATATATATATAT'. No other sequences are filtered, to avoid bias in
selecting sequences.
OUTPUT
Files
Four files are created in the working directory. Option -o defines the
[out] character.
sim_[out].bam : A binary alignment file of the input template library.
pseudo_[out].txt : A pseudo-annotation text file, identifying bins for
simulated reads.
sim_[out].txt : A tab-delimited text file giving the coordinates and
names for each simulated locus.
sim_[out].fa : A FASTA file of all of the simulated reads.
[input_fasta]_mmap.fa : A FASTA file containing reads which were found to multi-map.
[input_fasta]_unique.fa : A FASTA file containg uniquely mapping reads.
[input_fasta]_unaligned.fa : A FASTA file containing reads which failed to align to the genome.
Naming conventions
The FASTA header of each simulated read encode the basic information
about the read. Several different fields are separated by "_"
characters. For instance the following read header
">MIRNA_2_2_chr7:123063894-123063914_+_0" means ..
MIRNA: This came from a simulated MIRNA locus
2: The 2nd simulated locus of this type
2: Read number 2 from this locus
chr7:123063894-123063914: The true origin of this read.
+: The genomic strand of this read.
0: The number of sequencing errors simulated into the read.