Continuous Analysis RNA-Seq Differential Expression Analysis Example

This is a sample repository showing a Continuous Analysis Workflow for RNA-Seq analysis. A description of continuous analysis is available as a pre-print.

In this example we follow the workflow described by David Balli and use data generated from Boj et al. (open access). Balli used a similar workflow to the one described by Andrew Mckenzie.

To preform this analysis we use several tools:

Sample Results

The Continuous Analysis process generates several useful artifacts, two of which are (the full experiment is described below):

Change logs/synchronization between code and figures.

The effect of adding an additional organoid derived from pancreatic adenocarcinoma on principal components analysis using Kallisto’s estimated counts.

A volcano plot plotting the p-value vs. the log fold change. Adding an additional organoid derived from pancreatic adenocarcinoma leads to an additional gene being marked as significantly differentially expressed after Benjamini & Hochberg correction.

Complete "audit" logs of the code run: Logs

Description of analysis

We followed a reduced analysis workflow demonstrated by Balli using the SRA files for 8 samples: 2 normal, 3 mP, 3 mT). These samples represent extract to approximately 480 million reads and 150gb of data (FASTQ format). We perform this experiment first with 7 samples (2 normal, 3mP, 2mT) and then add the 8th sample to view the differences.

We perform two preprocessing steps prior to beginning continuous analysis (details/reasoning in continuous analysis configuration section).

Download the samples from the [Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra?term=SRP049959 "SRR1654626", "SRR1654628", "SRR1654633", "SRR1654636", "SRR16546367", “SRR1654639”, "SRR1654637", "SRR1654641", "SRR1654643") Split the .sra into fast q files using the SRA toolkit Download the mouse reference genome assembly

Continuous Analysis Run (script):

Generate a kallisto index file from the reference file and quantify abundances of transcripts from each RNA-Seq sample (run on 28 cores).
The next portion of the analysis is performed from ‘r_script.r’ and follows the workflow described by Balli: Generate the transcripts per million (TPM) matrix.
Create a matrix to specify the group each sample belongs to.
Filter out lowly expressed genes.
Generate a principle component plot
Fit the limma linear model for differential gene expression analysis.
Plot differential expression in the form of a volcano plot.

Feedback

Please feel free to email me - (brettbe) at med.upenn.edu with any feedback or raise a github issue with any comments or questions.

Acknowledgements

We would like to thank David Balli for his post providing the analysis design and significant source code used in this example.

This work is supported by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4552 to C.S.G. as well as the Commonwealth Universal Research Enhancement (CURE) Program grant from the Pennsylvania Department of Health.

rob-p / continuous_analysis_rnaseq

Continuous Analysis RNA-Seq Differential Expression Analysis Example

Sample Results

Description of analysis

Feedback

Acknowledgements

About

Languages