latch-verified / bulk-rnaseq

Bulk RNA-seq analysis.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Latch Verified

Bulk RNA-seq

Latch Verified

Produce transcript/count matrices from sequencing reads.

Current Release License Commit Activity Commits since Last Release

Hosted Interface · SDK Documentation · Slack Community

Workflow Anatomy

Disclaimer

This workflow assumes that your sequencing reads were derived from short-read cDNA sequencing ( as opposed to long-read cDNA/direct RNA sequencing). If in doubt, you can likely make the same assumption, as it is by far the most common form of "RNA-sequencing".

Brief Summary of RNA-seq

This workflow ingests short-read sequencing files (in FastQ format) that came from the following sequence of steps1:

  • RNA extraction from sample
  • cDNA synthesis from extracted RNA
  • adaptor ligation / library prep
  • (likely) PCR amplification of library
  • sequencing of library

You will likely end up with one or more FastQ files from this process that hold the sequencing reads in raw text form. This will be the starting point of our workflow.

(If you have a .bcl file, this holds the raw output of a sequencing machine. There are there are external tools that can convert these files to FastQ format, which you will need before you can proceed).

Quality Control

As a pre-processing step, its important to check the quality of your sequencing files. FastQC is the industry staple for generating a report of useful summary statistics2 and is available if you double-click on a file on the LatchBio platform.

The following are the most useful of these statistics:

  • Per base sequence quality gives the per-site distribution over the length of the read
  • Sequence duplication levels reveals duplicated reads, indicating degraded RNA samples or aggressive PCR cycling1

For a full breakdown of the values and their interpretation, we refer the reader to this tutorial.

Trimming

Short-read sequencing introduces adapters, small sequences attached to the 5' and 3' end of cDNA fragments, that are present as artifacts in our FastQ files and must be removed.

We have yet to identify a comprehensive review of the various trimming tools to benchmark both accuracy and speed, so we have selected TrimGalore trusted by researchers we work with out of UCSF and Stanford, until we are able to do so ourself.

Alignment

Alignment is the process of assigning a sequencing read a location on a reference genome or transcriptome. It is the most computationally expensive step of the workflow, requiring a comparison against the entire reference sequence for each of millions of reads.

Transcript alignment was initially conducted similarly to genomic alignment, using tools like Bowtie2 to rigorously recover reference coordinates for each read. This was eschewed for a lighter "pseudo-alignment" in the years that followed that assigned each read to a transcript rather than an exact location, saving time and resources. However, while these methods are faster, they have proven to be less accurate.3

In 2020, the Selective Alignment algorithm was introduced that performed a similar lightweight read assignment while simultaneously outperforming traditional alignment methods in accuracy.3 We utilize salmon to implement selective alignment.

Gene Count Quantification

Selective Alignment produces estimations of transcript abundances. Recall that that there can be multiple transcripts for any single gene. It is desirable to have estimated gene counts for two reasons:

  1. gene counts are a more stable measure of transcription.*
  2. gene counts are more interpretable

* Stability is loosely defined as consistent correlation with ground truth counts as the available (transcript) annotations begin to drop out. 4

We utilize tximport to perform the conversion of transcripts to read counts.

Footnotes

  1. Stark, Rory; Grzelak, Marta; Hadfield, James (2019). RNA sequencing: the teenage years. Nature Reviews Genetics, (), –. doi:10.1038/s41576-019-0150-2 2

  2. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

  3. Srivastava, A., Malik, L., Sarkar, H. et al. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol 21, 239 (2020). https://doi.org/10.1186/s13059-020-02151-8 2

  4. Soneson C, Love MI and Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences [version 1; peer review: 2 approved]. F1000Research 2015, 4:1521

About

Bulk RNA-seq analysis.

License:MIT License


Languages

Language:Python 92.0%Language:Dockerfile 5.8%Language:R 1.7%Language:Shell 0.5%