warning: this project is in active development and the readme may be out of date.
This pipeline adds to a wide field of tools which have been used to assess small-RNA-sequencing data. Yasma is a genome-based de novo approach which is focused on the whole sRNA population, not just a few classes like miRNAs.
There are other approaches that follow a similar strategy, namely ShortStack. Yasma tries to solve some persistent issues with these approaches, which appear to be exacerbated in challenging systems, such as sRNAs in Fungi.
There is a manuscript in preparation detailing the value of the approache presented here. This will be updated with a bioRxiv when submitted to a journal (hopefully very soon).
- Over-merging of distinct, but closely oriented loci.
- Creeping annotations which don't don't represent the shape of a expressed region.
- Under-merging where numerous similar loci are annotated separately due to sequencing gaps.
- Sensitivity to the depth of a sRNA library relative to it's assembly size. This seems to be particularly problematic in fungi.
Yasma relies on sRNA alignments to form genomic annotations. Alignments are performed using ShortStack3/4's alignment protocol (based on bowtie1), which is well supported by Johnson et al 2016.
Annotation based on these alignments follows a multi step approach:
- Building a sRNA coverage profile.
- Calculating the RPM threshold which best balances annotating the most reads in the smallest genomic space.
- Building a profile of genomic-regions which are sufficiently deep based on this threshold.
- Merging of peaks which have similar sRNA profiles.
This results in contiguous loci which are more homogenous in profile. It also tends to avoid over-annotation of background sequences.
Yasma is written in python 3.x
. It is not yet in any package managers, but it is fairly easy to install directly with github. This should work in linux/unix systems, though I'm sure bugs will crop up (make a issue request please!!).
## cloning the repo with git
git clone https://github.com/NateyJay/YASMA.git
## moving it somewhere permanent.
mv ./YASMA /usr/local/
## adding this to $PATH - you will want to add the following line to your ~/.bash_rc (linux) or ~/.bash_profile (mac). You can open it using: nano ~/.bash_profile
## export PATH="/usr/local/YASMA:$PATH"
## sourcing the new ~/.bash_profile file
source ~/.bash_profile
This is a useful tool for managing github repos you use. Great for those that like command-line and github, but find the git
and gh
cli tools cumbersome.
With this, you can download repo directly from NateyJay/YASMA
. You will also need to add the folder to your path, but probably don't want to move it.
If you don't want to use git for some reason, you can download the latest release with curl. This may have a different directory name so you need to adjust accordingly.
curl -L -O https://github.com/NateyJay/YASMA/archive/refs/tags/v0.1.0-beta.zip
unzip v0.1.0-beta
## move and add to path as above.
Yasma makes use of many tools through wrappers, as well as several non-standard python modules. Most of these should be easy enough to install.
click
is required, as it manages the cli interface for the tool
python3 -m pip install click
python3 -m pip install click-option-group
pyBigWig
allows python-native functions with bigwig files.
python3 -m pip install pyBigWig
# or
conda install pybigwig -c conda-forge -c bioconda
Most of these are required for basic functions - each module will inform you if you are missing something. Yasma expects each of these executable from the PATH.
samtools
bowtie
(1)ShortStack
(3 or 4)rnafold
from the ViennaRNA package.cutadapt
Yasma is organized into several modules, made with the CLI-module click. These modules are organized into several major sections which are generally ordered by processing step:
Commands:
Preliminary:
inputs A tool to log inputs, which will be...
Processing:
adapter Tool to check untrimmed-libraries for 3'...
trim Wrapper for trimming using cutadapt.
align Wrapper for alignment using ShortStack/bowtie.
Annotation:
tradeoff Annotator using large coverage window and...
Calculation:
context Compares annotations to identify cluster...
count Gets counts for all readgroups, loci, strand,...
hairpin Evaluates annotated loci for hairpin or miRNA...
jbrowse Tool to build coverage and config files for...
Utilities:
subsample Utility to subsample libraries to a specific...
cram-to-bam Changes crams to bam alignments.
normalize-alignment-name Fixes old alignment file names.
size-profile Convenience function for calculating aligned...
readgroups Convenience function to list readgroups in an...
Ann. wrappers:
shortstack3 Wrapper for annotation using ShortStack3.
shortstack4 Wrapper for annotation using ShortStack4.
To help with ease of use, Yasma orients all of its analyses around a directory. Files produced and referenced by yasma are all stored in the config.json
file, using relative paths. Analyses that produce outputs will automatically update this file, meaning you need not manually transmit information from one module to the next (for example: finding an adapter sequence, then trimming the libraries with it).
config.json
is human-readable and can be pretty easily modified manually, though not normally advisable.
All modules will automatically produce config.json
if it is not found, and use lazy evaluation looking for included values. This makes it easy to jump in at a later step if you have done prior analyses separately.
Modules can be run simply with yasma.py [module] -o output_directory_path [...]
. The only required option for all modules is -o, --output_directory
, and yasma will automatically tell you if you are missing any other inputs.
Inputs is not a required step, but it can be a major time-saver. Basically, it produces a file config.json
which can store all input files for your analysis. If you specify these files here, you need not call them in subsequent steps.
This also lets you know if there are incongruities in your data. For example, it compares chromosome names found in your reference genome to a gene annotation, showing if they don't match. This can frequently be a real time-saver as it catches common errors.
Using these modules, yasma can look for adapter sequences, trim libraries (using cutadapt), and align them to a genome (using shortstack3/4 x bowtie1).
All of these could be run manually, but alignment with shortstack is essential as the annotation looks for readgroup information in shortstack's bam format.
The main annotation module is called tradeoff
, due to its threshold finding with a read vs genome tradeoff. This analysis should work on any shortstack bam/cram alignment.
There are several options, but an essential one specified here is -r, --annotation_readgroups
. This allows you to make your annotation based on a smaller group of libraries from your whole alignment. This is really useful when working with a large analysis with multiple replicate-groups, and you might only want to annotate sRNAS in one of them (e.g. wt replicates among many mutants).
Many outputs are produced from this step, with some described here:
loci.gff3
andloci.txt
- the core annotation output, identifying loci and their dimensions (in gff and tabular formats).coverage.bw
andkernel.bw
- track files associated with the sRNA alignment coverage and padded_coverage by max() (kernel).regions.gff3
andrevised_regions.gff3
- annotation files of distinct sRNA regions based off the padded_coverage, and the revision of those regions including nearby similar sRNAs.thresholds.txt
- a table showing the percent of the genome retrieved and reads annotated for each threshold in sRNA abundance.reads.txt
- a breakdown of the aligned reads making up the top 30% of a locus's total expression. This is useful to quickly get constituent sequences in complex loci.
These are secondary calculations that will be done on a tradeoff annotation.
yasma.py count
This produces a file of counts for every locus. This is broken out into a separate module because yasma does this very thoroughly, producing a simple count matrix (useful for DEseq) and also a large, long format count with separates counts by locus, strand, size, and readgroup. This can save some headaches in later analyses.
yasma.py context
Compares loci locations with an NCBI-formatted .gff3
file provided. Gives overlaps and nearby genes for each sRNA locus.
yasma.py jbrowse
This module was made to take some of the headache out of making nicely-formatted jbrowse-ready coverage and annotation maps. This produces .bw
files for all specified sizes and strands of sRNAs, which can then be plotted in the same track with provided configuration code. If provided with a -j, --jbrowse_directory
, this will automatically look for a config file, update it, and copy all relevant files to a directory based on the genome name.
yasma.py hairpin
This tool is still in development, but it is meant to evaluate all loci for the possibility that they are derived from an RNA hairpin, rather than RDR-dsRNA. This is not finalized, but it generally looks for stranded regions and folds them, analyzing their profile based on a battery of rules from multiple publications.
We love shortstack here. Consdering it is essential for the alignment of our data, we also include wrappers for ShortStack annotation built into the directory organization of this tool. Useful for easily comparing annotations. Requires that ShortStack3
or ShortStack4
are executable from command line (these is not their normal names: "ShortStack").
utilities
includes several other functions, most of which have been used primarily for testing. Probably not relevant as of now to wider use.