RNA-seq: a step-by-step analysis pipeline.

A step-by-step analysis pipeline for RNA-seq data from the Cebola Lab.

Correspondence: hannah.maude12@imperial.ac.uk

The resources and references used to build this tutorial are found at the bottom, in the resources section.

Run using command line tools (bash):

Pre-alignment quality control (QC)
Align to the reference human genome
Post-alignment QC
Visualisation
Quantify transcripts
Visualise tracks against the reference genome

Run in R:

Differential gene expression (DGE) analysis

Quality metrics: throughout this Github, icons will show where QC measures are obtained. An excel spreadsheet can be downloaded from this Github which these metrics can be input into and saved.

Programs required: it is recommended that the user has anaconda installed, through which all required programs can be installed. Assuming that anaconda is available, all the required programs can be installed using the following:

#Install the required programs using anaconda
conda create -N RNA-seq

conda install -n RNA-seq -c bioconda fastqc
conda install -n RNA-seq -c bioconda fastp
conda install -n RNA-seq -c bioconda multiqc
conda install -n RNA-seq -c bioconda star
conda install -n RNA-seq -c bioconda samtools
conda install -n RNA-seq -c bioconda deeptools
conda install -n RNA-seq -c bioconda salmon

#For differential expression using DESeq2
conda create -N DEseq2 r-essentials r-base

conda install -N DEseq2 -c bioconda bioconductor-deseq2
conda install -N DEseq2 -c bioconda bioconductor-tximport 
conda install -N DEseq2 -c r r-ggplot2

Introduction

This pipeline is compatabile with RNA-seq reads generated by Illumina.

Pre-alignment QC

Generate QC report

The raw sequence data should first be assessed for quality. FastQC reports can be generated for all samples to assess sequence quality, GC content, duplication rates, length distribution, K-mer content and adapter contamination. For paired-end reads, run fastqc on both files, with the results output to the current directory:

fastqc <sample>_1.fastq.gz -d . -o .

fastqc <sample>_2.fastq.gz -d . -o .

These fastQC reports can be combined into one summary report using multiQC.

To extract the total number of reads from the fastQC report, run the following code (replacing with your file name).

totalreads=$(unzip -c <sample>_fastqc.zip <sample>_fastqc/fastqc_data.txt | grep 'Total Sequences' | cut -f 2)

echo $totalreads
#This number will be used again later so is saved as a variable 'totalreads'

QC value: input the total number of reads into the QC spreadsheet.

Trimming

Trimming is a useful step of pre-alignment QC, which removes low quality reads and contaminating adapter sequences (which occur when the length of DNA sequences is longer than the DNA insert).

If there is evidence of adapter contamination shown in the fastQC report (see below), specific adapter sequences can be trimmed. Here, the program fastp is used to trim the data. For paired-end data:

#Change the -l argument to change the minimum read length allowed.
fastp -i <sample>_R1.fastq.gz -I <sample>_R2.fastq.gz -o <sample>_R1.trimmed.fastq.gz -O <sample>_R2.trimmed.fastq.gz --detect_adapter_for_pe -l 25 -j <sample>.fastp.json -h <sample>.fastp.html

For single-end reads: (note the adapter detection is not always as effective for single-end reads, so it is advisable to provide the adapter sequence, here the 'Illumina TruSeq Adapter Read 1'):

fastp -i <sample>.fastq.gz -o <sample>-trimmed.fastq.gz -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -l 25 -j <sample>.fastp.json -h <sample>.fastp.html

A html report is generated, including the following information:

Here, fastQC should be repeated to generated reports for the trimmed data and a second multiqc report generated:

fastqc <sample>_R1.trimmed.fastq.gz -d . -o .

fastqc <sample>_R2.trimmed.fastq.gz -d . -o .

QC value: the number of trimmed reads can be filled in using the fastp report, or by extracting the number of reads from the trimmed fastQC files, as above, and used to fill in the QC spreadsheet.

Align to the reference genome

The raw RNA-seq data in fastq format will be aligned to the reference genome, along with a reference transcriptome, to output two alignment files: the genome alignment and the transcriptome alignemnt.

The DNA reads are aligned using the splice-aware aligner, STAR. Here, STAR is used. The manual is available here. The reference genome used is the GRCh38 'no-alt' assembly from ncbi, recommended by Heng Li. The genome can be downloaded using wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz. This version of the recent GRCh38 reference genome excludes alternative contigs which may cause fragments to map in multiple locations. The downloaded genome should be indexed with STAR.

Index the reference genome

Set --sjdbOverhang to your maximum read length -1. The indexing also requires a file containing gene annotation, which comes in a gtf format. For example, ENCODE provides a gtf file with GRCh38 annotations, containing gencode gene coordinates, along with UCSC tRNAs and a PhiX spike-in. Here, we use gencode.v36.annotation.gtf as the most recent gene annotation file. The user should aim to use the most up-to-date reference files, while ensuring that the format is the same as the reference genome. For example, UCSC uses the 'chr1, chr2, chr3' naming convention, while ENSEMBL uses '1, 2, 3' etc. The files suggested here are compatible.

GENOMEDIR=/path/to/indexed/genome

STAR --runThreadN 4 --runMode genomeGenerate --genomeDir $GENOMEDIR --genomeFastaFiles $GENOMEDIR/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --sjdbGTFfile gencode.v36.annotation.gtf --sjdbOverhang readlength -1

Carry out the alignment

STAR can then be run to align the fastq raw data to the genome. If the fastq files are in the compressed .gz format, the --readFilesCommand zcat argument is added. The output file should be unsorted, as required for the downstream quantification step using Salmon. The following options are shown according to the ENCODE recommendations.

For paired-end data:

STAR --runThreadN 4 --genomeDir $GENOMEDIR --readFilesIn <sample>_R1.trimmed.fastq.gz <sample>_R2.trimmed.fastq.gz
--outFileNamePrefix <sample> --readFilesCommand zcat --outSAMtype BAM Unsorted --quantTranscriptomeBan Singleend --outFilterType BySJout 
--alignSJoverhangMin 8 --outFilterMultimapNmax 20
--alignSJDBoverhangMin 1 --outFilterMismatchNmax 999
--outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 
--alignIntronMax 1000000 --alignMatesGapMax 1000000 
--quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD

For single-end data:

STAR --runThreadN 4 --genomeDir $GENOMEDIR --readFilesIn <sample>-trimmed.fastq.gz 
--outFileNamePrefix <sample> --readFilesCommand zcat --outSAMtype BAM Unsorted --quantTranscriptomeBan Singleend --outFilterType BySJout 
--alignSJoverhangMin 8 --outFilterMultimapNmax 20
--alignSJDBoverhangMin 1 --outFilterMismatchNmax 999
--outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 
--alignIntronMax 1000000 --alignMatesGapMax 1000000 
--quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD

Hint: all the above code should be on one line!

For compatibility with the STAR quantification, the --quantMode TranscriptomeSAM option will result in the output of two alignment files, one to the reference genome (Aligned.*.sam/bam) and one to the transcriptome (Aligned.toTranscriptome.out.bam).

Merge files [optional]

At this stage, if samples have been sequenced across multiple lanes, the sample files can be combined using samtools merge. Various QC tools can be used to assess reproducibility and assess lane effects, such as deeptools plotCorrelation. The salmon quantification does not require files to be merged, since multiple bam files can be listed in the command. However, to visualise the RNA-seq data from the combined technical replicates, bam files can be merged at this stage. For example, if your sample was split across lanes 1, 2 and 3 (L001, L002, L003):

samtools merge <sample>-merged.bam <sample>_L001.bam <sample>_L002.bam <sample>_L003.bam

Post-alignment QC

The STAR alignment will have output several files with the following file names:

Aligned.out.bam
Aligned.toTranscriptome.out.bam
Log.final.out
Log.out
Log.progress.out
SJ.out.tab

Two files will be used in downstream analysis, the Aligned.out.bam for generating genome browser bigWig tracks and the Aligned.toTranscriptome.out.bam for quantification and differential gene expression analysis. First, the Aligned.out.bam will be assessed for quality and processed to generate bigWig tracks.

Generate QC reports using qualimap and samtools

Qualimap will be run on the Aligned.out.bam file (or <sample>.merged.bam if you have merged data).

Qualimap will provide several measures of quality, including how many reads have aligned to exons vs non-coding intergenic regions. To do this, qualimap requires a transcript file which contains the information containing the locations of coding regions. The transcript annotation file, gencode.v36.annotation.gtf can be downloaded from gencode. (Note: the most recent annotation file should be used.)

#Sort the output bam file. The suffix of the .bam input file may be .gzAligned.out.bam, or -merged.bam. Edit this code to include the appropriate file name.
samtools sort <sample>.bam > <sample>-sorted.bam
samtools index <sample>-sorted.bam

samtools flagstat <sample>-sorted.bam > <sample>-sorted.flagstat

#Run qualimap to generate QC reports
qualimap bamqc -bam <sample>-sorted.bam -gff gencode.v36.annotation.gtf -outdir <sample>-bamqc-qualimap-report --java-mem-size=16G

qualimap rnaseq -bam <sample>-sorted.bam -gtf gencode.v36.annotation.gtf -outdir <sample>-rnaseq-qualimap-report --java-mem-size=16G

QC value: the percentage of reads aligned to exons can be extracted as follows:

cat <sample>-qualimap-rnaseq/rnaseq_qc_results.txt | grep exonic | cut -d '(' -f 2 | cut -d ')' -f1

QC value: the number of aligned reads and aligned reads which were properly paired can be extracted as follows:

#The total number of reads mapped
cat <sample>.flagstat | grep mapped | head -n1 | cut -d ' ' -f1

#The total number of properly paired reads
cat ../bam_files/061818_con.flagstat | grep 'properly paired' | head -n1 | cut -d ' ' -f1

A combined qualimap report

Qualimap multi-bamqc can then run QC on combined samples and replicates. This includes principal component analysis (PCA) to confirm whether technical and/or biological replicates cluster together. A text file (samples.txt) should be created with three columns, the first with the sample ID, the second with the full path to the bamqc results and the third with the group names.

Note, some versions of qualimap require the raw_data_qualimapReport directory to be renamed to raw_data.

qualimap multi-bamqc sample.txt

The QC reports can be combined using multiqc; an excellent tool for combining QC reports of multiple samples into one. Example outputs of qualimap/multiqc include the alignment positions

Remove duplicates?

It is generally recommended to not remove duplicates when working with RNA-seq data, unless using UMIs (unique molecular identifiers) (Klepikova et al. 2017). This is because there are likely to be DNA molecules which are natural duplicates of each other, for example originating from genes with a shared sequence in a common domain. Typically, removing duplicates does more harm than good. It is more or less impossible to remove duplicates from single-end data and research has also suggested it may cause false negatives when applied to paired end data. See more in this useful blog post. Generally, duplicates are not a problem so long as the library complexity is high.

Visualisation

Compute GC bias

GC-bias describes the bias in sequencing depth depending on the GC-content of the DNA sequence. Bias in DNA fragments, due to the GC-content and start-and-end sequences, may be increased due to preferential PCR amplification (Benjamini and Speed, 2012). A high rate of PCR duplications, for example when library complexity is low, may cause a significant GC-bias due to the preferential amplification of specific DNA fragments. This can significantly impact transcript abundance estimates. Bias in RNA-seq is explained in a handy blog and video by Mike Love.

It is crucial to correct GC-bias when comparing groups of samples which may have variable GC content dependence, for example when samples were processed in different libraries. Salmon, used later to generate read counts for quantification, has its own in-built method to correct for GC-bias.

When generating bedGraph or BigWig files for visualisation, the user may opt to correct GC-bias so that coverage is corrected and appears more uniform. The deeptools suite includes tools to calculate GC bias and correct for it.

The reference genome file should be converted to .2bit format using faToTwoBit. The effective genome size can be calculated using faCount available here. Set the -l argument to your fragment length.

The input bam file requires an index, which can be generated using samtools index.

deeptools computeGCBias -b <sample>-sorted.bam --effectiveGenomeSize 3099922541 -g GCA_000001405.15_GRCh38_no_alt_analysis_set.2bit -l 100 --GCbiasFrequenciesFile <sample>.freq.txt  --biasPlot <sample>.biasPlot.pdf

The bias plot format can be changed to png, eps, plotly or svg. If there is significant evidence of a GC bias, this can be corrected using correctGCbias. An example of GC bias can be seen in the plot outout from computeGCBias below:

Correct the GC-bias using correctGCBias. This tool effectively removes reads from regions with greater-than-expected coverage (GC-rich regions) and adds reads from regions with less-than-expected coverage (AT-rich regions). The methods are described by Benjamini and Speed [2012]. The following code can be used:

 correctGCBias -b <sample>-sorted.bam --effectiveGenomeSize 3099922541 -g GCA_000001405.15_GRCh38_no_alt_analysis_set.2bit --GCbiasFrequenciesFile <sample>.freq.txt -o <sample>.gc_corrected.bam [options]

NOTE: When calculating the GC-bias for ChIP-seq, ATAC-seq, DNase-seq (and CUT&Tag/CUT&Run) it is recommended to filter out problematic regions. These include those with low mappability and high numbers of repeats. The compiled list of ENCODE blacklist regions should be excluded. However, the ENCODE blacklist regions have little overlap with coding regions and this step is not necessary for RNA-seq data (Amemiya et al, 2019).

Generate bigwig files

The bam file aligned to the genome should be converted to a bigWig format, which can be uploaded to genome browsers and viewed as a track. First, the bam file aligned to the reference genome may be assessed and corrected for GC bias, to acheive a more even coverage.

The gene counts are here normalised to TPM values during conversion.

bamCoverage -b <sample>.gc_corrected.bam -o <sample>.bw --normalizeUsing BPM --samFlagExclude 512

There are multiple methods available for normalisation. Recent analysis by Abrams et al. (2019) advocated TPM as the most effective method.

Check correlation of technical and biological replicates

The correlation between bam files of biological and/or technical replicates can be calculated as a QC step to ensure that the expected replicates positively correlate. Deeptools multiBamSummary and plotCorrelation are useful tools for further investigation.

Quantification

The bam file previously aligned to the transcriptome by STAR will next be input into Salmon in alignment-mode, in order to generate a matrix of gene counts. The Salmon documentation is available here.

Generate transcriptome

Salmon requires a transcriptome to be generated from the genome fasta and annotation gtf files used earlier with STAR. This can be generated using gffread (source package avaiable for download here).

gffread -w GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa -g GCA_000001405.15_GRCh38_no_alt_analysis_set.fna gencode.v36.annotation.gtf

Run Salmon

Salmon is here used with the variational Bayesian expectation minimisation (VSEM) algorithm for quantification. Quanitifcation is described in the 2020 paper by Deschamps-Francoeur et al., which describes the handling of multi-mapped reads in RNA-seq data. Duplicated sequences such as pseudogenes can cause reads to align to multiple positions in the genome. Where transcripts have exons which are similar to other genomic sequences, the VSEM approach attributes reads to the most likely transcript. Technical replicates can also be combined by providing the Salmon -a argument with a list of bam files, with the file names separated by a space (this may not work on all queue systems. A common error is segmentation fault (core dump)). Here, Salmon is run without any normalisation, on each technical replicate; samples are combined and normalised in the next steps.

For paired-end data:

salmon quant -t GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa --libType A -a <sample>.Aligned.toTranscriptome.out.bam -o <sample>.salmon_quant --gcBias --seqBias

For single-end data:

If using single end data, add the --fldMean and --fldSD parameters to include the mean and standard deviation of the fragment lengths. If listing multiple files to be combined, the library type will need to be specified, as Salmon cannot determine it automatically (see the Salmon documentation for more information).

salmon quant -t GRCh38_no_alt_analysis_set_gencode.v36.transcripts.fa --libType ?? -a <sample>.Aligned.toTranscriptome.out.bam -o <sample>.salmon_quant  --fldMean ?? --fldSD ?? --gcBias --seqBias

Differential expression

All following code should be run in R.

The differential expression analysis contains the following steps:

Import count data
Import data to DEseq2
Differential gene expression
QC plots

Following these steps, functional analysis will be carried out to investigate differential expression of biological pathways. In this analysis, GC-normalised counts from Salmon will be input into DESeq2, which will run the standard DESeq2 normalisation. Optionally, normalisation can be carried out using cqn to correct for sample-specific biases (described at the end of this page). If cqn is the method of choice, Salmon should be run without the --gcBias flag.

To install the required packages:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

#BiocManager::install("cqn") #optional for cqn normalisation
BiocManager::install("DESeq2")
BiocManager::install("tximport")
BiocManager::install("biomaRt")

Import count data

The output from Salmon are TPM values (the 'abundance', transcripts per million) and estimated counts mapped to transcripts. The counts will be combined to gene-level estimates in R. The output files from salmon, quant.sf will be imported into R using tximport (described in detail here by Love, Soneson & Robinson). This will require a list of sample IDs as well as a file containing transcript to gene ID mappings, in order to convert the transcriptome alignment to gene-level counts.

Create a matrix containing the sample IDs. The matrix should have at least three columns: the first with the sample IDs, the second with the path to the salmon quant.sf files, and the third with the group (e.g. treatment or sample). This can be generated in excel, for example, and saved as a tab-delimited txt file called samples.txt.

#Read in the files with the sample information
samples = read.table('samples.txt')

Read in the transcript to gene ID file provided in this repository (generated from gencode v36).

#Read in the gene/transcript IDs 
tx2gene = read.table('tx2gene_gencodev36-unique.txt', sep = '\t')

Read in the count data using tximport. This will combine the transcript-level counts to gene-level.

library(tximport)

#Column 2 of samples, samples[,2], contains the paths to the quant.sf files
counts.imported = tximport(files = as.character(samples[,2]), type = 'salmon', tx2gene = tx2gene)

To use cqn normalisation, see the optional description at the end. Otherwise, the default DESeq2 normalisation will be used.

Import data to DEseq2

An excellent tutorial on how DEseq2 works, including how different expression is calculated including dispersion estimates, is provided in this hbctraining lesson and in the DEseq2 vignette.

The counts information will be input into DEseq2. A data-frame called colData should be generated. The rownames will be the unique sample IDs, while the columns should contain the conditions being tested for differential expression, in addition to any effects to be controlled for. In the example below, the column called condition contains the treatment, while the column batch contains the donor ID. Other covariants such as age could be added, for example.

The design, as shown below, should read ~ batch + condition, where batch is an effect to be controlled for and condition is the condition to be tested, such as treated vs untreated or disease vs healthy. batch and condition (or your own variables with your preferred names), should be columns in colData.

#Import to DEseq2
counts.DEseq = DESeqDataSetFromTximport(counts.imported, colData = colData, design = ~batch + condition)

dds <- DESeq(counts.DEseq)
resultsNames(dds) #lists the coefficients

plotDispEsts(dds)

#Add the normalisation offset from cqn
#normalizationFactors(dds) <- cqnNormFactors

Differential gene expression

There are several models available to calculate differential gene expression. Here, the apeglm shrinkage method will be applied to shrink high log-fold changes with little statistical evidence and account for lowly expressed genes with significant deviation. This hbctraining tutorial described the DEseq2 model fitting and hypothesis testing.

library(apeglm)

#List the names of the coefficients and choose your comparison
resultsNames(dds)

#Substitute the '????' with a comparison, selected from the resultsNames(dds) shown above
LFC <- lfcShrink(dds, coef = "????", type = "apeglm")

The contens of the LFC dataframe contain the log2 fold-change, as well as the p-value and adjusted p-value:

Following quality control analysis, we will explore the data to check the numbers of differentially expressed genes (DEGs), the top DGEs and pathways of differential expression.

QC plots

Before moving on to functional analysis, such as gene set enrichment analysis, quality control should be carried out on the differential expression analyses. The types of plots which will be generated below are:

Principal component analysis - sample clustering
Biological replicate correlation
MD plot
p-value distribution
Volcano plot

Principal component analysis

A common component of analysing RNA-seq data is to carry out QC by testing if expected samples cluster together. One popular tool is principal component analysis (PCA) (the following steps are adapted from a hbctraining tutorial on clustering). Useful resources include this blog post by Zakaria Jaadi and a video on PCA by StatQuest.

If you have few samples:

rld <- rlog(dds, blind = TRUE)
rld_mat <- assay(rld)
pca <- prcomp(t(rld_mat))

If you have more samples (e.g. >20), the vst transformation will be faster:

vst.r <- vst(dds,blind = TRUE)
vst_mat <- assay(vst)
pca <- prcomp(t(vst_mat))

The results can be plotted using ggplot2. Several examples are provided below:

library(ggplot2)

z = plotPCA(vst.r, "condition")
nudge <- position_nudge(y = 2,x=6)
z + geom_text(aes(label = name), position = nudge) +theme 

#plotPCA from DEseq2 plots uses the top 500 genes:
data = plotPCA(rld, intgroup = c("condition", "batch"), returnData = TRUE)
p <- ggplot(data, aes(x = PC1, y = PC2, color = condition ))
p <- p + geom_point() + theme 
print(p)

#Alternatively, PCA can be carried out using all genes:
df_out <- as.data.frame(pca$x)
df_out$group <- samples[,3]

#Include the next two lines to add the PC % to the axis labels
#percentage <- round(pca$sdev / sum(pca$sdev) * 100, 2)
#percentage <- paste( colnames(df_out), paste0(" (", as.character(percentage), "%", ")"), sep="") 

p <- ggplot(df_out, aes(x = PC1, y = PC2, color = group))
p <- p + geom_point() + theme #+ xlab(percentage[1]) + ylab(percentage[2])

print(p)

To generate the PCA plot with any batch effects removed:

#Batch effect (donor) removed
assay(vst.r) <- limma::removeBatchEffect(assay(vst.r), vst.r$batch)

z=plotPCA(vst.r, "condition")
nudge <- position_nudge(y = 1,x=4)
z + geom_text(size=2.5, aes(label = name), position = nudge) + theme 

#An example with no labels
z=plotPCA(vst.r, "condition")
nudge <- position_nudge(y = 1,x=4)
z + geom_text(size=2.5, aes(label = NA), position = nudge) + theme

MA plot

An MA plot is a scatter plot of the log fold-change between two samples against the average gene expression (mean of normalised counts). An MA plot can be generated using the following command from DEseq2:

#Add a title to reflect your comparison 
plotMA(LFC, main = '???', cex = 0.5)

Distribution of p-values and FDRs

The distribution of p-values following a differential expression analysis can be an indication of whether there is an enrichment of differentially expressed genes and whether the statistical test is correct, i.e. has the correct assumptions.

#The distribution of p-values
hist(LFC$pvalue, breaks = 50, col = 'grey', main = '???', xlab = 'p-value')

#The false-discovery rate distribution
hist(LFC$padj, breaks = 50, col = 'grey', main = '???', xlab = 'Adjusted p-value')

The p-value distribution:

The false discovery rate (FDR) distribution:

Volcano plots

A volcano plot is a scatterplot which plots the p-value of differential expression against the fold-change. The volcano plot can be designed to highlight datapoints of significant genes, with a p-value and fold-change cut off.

Volcano plots are generated as described by Ignacio González

#Allow for more space around the borders of the plot
par(mar = c(5, 4, 4, 4))

#Set your log-fold-change and p-value thresholds
lfc = 2
pval = 0.05

tab = data.frame(logFC = LFC$log2FoldChange, negLogPval = -log10(LFC$padj))#make a data frame with the log2 fold-changes and adjusted p-values

plot(tab, pch = 16, cex = 0.4, xlab = expression(log[2]~fold~change),
     ylab = expression(-log[10]~pvalue), main = '???') #replace main = with your title

#Genes with a fold-change greater than 2 and p-value<0.05:
signGenes = (abs(tab$logFC) > lfc & tab$negLogPval > -log10(pval))

#Colour these red
points(tab[signGenes, ], pch = 16, cex = 0.5, col = "red")

#Show the cut-off lines
abline(h = -log10(pval), col = "green3", lty = 2)
abline(v = c(-lfc, lfc), col = "blue", lty = 2)

mtext(paste("FDR =", pval), side = 4, at = -log10(pval), cex = 0.6, line = 0.5, las = 1)
mtext(c(paste("-", lfc, "fold"), paste("+", lfc, "fold")), side = 3, at = c(-lfc, lfc),
      cex = 0.6, line = 0.5)

The resulting plot will look like this:

Data exploration

How many genes are differentially expressed? What are the top DEGs? How do I plot the expression for candidate genes?

How many genes are differentially expressed?

#increased expression
attach(as.data.frame(LFC))

#The total number of DEGs with an adjusted p-value<0.05
summary(LFC, alpha=0.05)

#The total number of DEGs with an adjusted p-value<0.05 AND absolute fold-change > 2
sum(!is.na(padj) & padj < 0.05 & abs(log2FoldChange) >2)

#Decreased expression:
sum(!is.na(padj) & padj < 0.05 & log2FoldChange <0) #any fold-change
sum(!is.na(padj) & padj < 0.05 & log2FoldChange <(-2)) #fold-change greater than 2

#Increased expression:
sum(!is.na(padj) & padj < 0.05 & log2FoldChange >0) #any fold-change
sum(!is.na(padj) & padj < 0.05 & log2FoldChange >2) #fold-change greater than 2

What are the top genes?

#At this stage it may be useful to create a copy of the results with the gene version removed from the gene name, to make it easier for you to search for the gene name etc. 
#The rownames currently appear as 'ENSG00000175197.12, ENSG00000128272.15' etc.
#To change them to 'ENSG00000175197, ENSG00000128272'
LFC.gene = as.data.frame(LFC)

#Some gene names are repeated if they are in the PAR region of the Y chromosome. Since dataframes cannot have duplicate row names, we will leave these gene names as they are and rename the rest.
whichgenes = which(!grepl('PAR', rownames(LFC.gene)))
rownames(LFC.gene)[whichgenes] = unlist(lapply(strsplit(rownames(LFC.gene)[whichgenes], '\\.'), '[[',1))
 
#subset the significant genes
LFC.sig = LFC.gene[padj < 0.05 & !is.na(padj),]#subset the significant genes

#We can add a column with the HGNC gene names
library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

converted <- getBM(attributes=c('hgnc_symbol','ensembl_gene_id'), filters = 'ensembl_gene_id',
                 values = rownames(LFC.sig), mart = ensembl)

#Add gene names to the LFC.sig data-frame
LFC.sig$hgnc = converted[converted[,2] == rownames(LFC.sig),1]

#View the top 10 genes with the most significant (adjusted) p-values
head(LFC.sig, n = 10)

#The largest fold-changes with a significant p-value
LFC.sig[order(abs(LFC.sig$log2FoldChange), decreasing = TRUE),][1:10,] #add the [1:10,] to see the top 10 rows

Can I plot the expression for the top genes?

#Select your chosen gene 
tmp = plotCounts(dds, gene = grep('ENSG00000000003', names(dds), value = TRUE), intgroup = "condition", pch = 18, main = '??? expression', returnData = TRUE)

theme <- theme(panel.background = element_blank(), panel.border = element_rect(fill = NA),
             plot.title = element_text(hjust = 0.5))

p <- ggplot(tmp, aes(x = condition, y = count)) + geom_boxplot() + 
  geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 0.6) + ggtitle('??? expression') + theme

print(p)

Functional analysis

Functional analysis can further investigate the differential expression of each gene. Pathway analysis is a popular approach with which to investigate the differential expression of pathways, including genes with similar biological functions. This can be achieved using gene set analysis (GSA). There are many flavours of GSA. They can be categorised as shown below by Das et al. (2020).

Here, we will use one gene annotation approach and one gene set enrichment analysis (GSEA) approach.

GoSeq - gene annotation

GoSeq, developed by Young et al. (2010), tests for the enrichment of Gene Ontology terms.

BiocManager::install("goseq")
library(goseq)

#Extract the differential expression data, with false discovery rate correction
groups12.table <- as.data.frame(topTags(ql.groups12, n = Inf))

#Remove the version numbers from the ENSEMBL gene IDs
rownames(groups12.table) <- unlist(lapply(strsplit(rownames(groups12.table), '\\.'), `[[`, 1))

The genes can be seperated into those which show significantly increased expression and those which show significantly decreased expression. Here, the FDR threshold is set to 0.05. A minimum fold-change can also be defined.

#Decreased expression
ql53.DEGs.down <- groups12.table$FDR < 0.05 & groups12.table$logFC<0
names(ql53.DEGs.down) <- rownames(groups12.table)
pwf.dn <- nullp(ql53.DEGs.up, "hg19", "ensGene")
go.results.dn <- goseq(pwf.dn, "hg19", "ensGene")

#Increased expression
ql53.DEGs.up <- groups12.table$FDR < 0.05 & groups12.table$logFC>0
names(ql53.DEGs.up) <- rownames(groups12.table)
pwf.up <- nullp(ql53.DEGs.down, "hg19","ensGene")
go.results.up <- goseq(pwf.up, "hg19","ensGene")

The go.results.up dataframe looks like this:

Significant results can be saved...

write.table(go.results.up[go.results.up$over_represented_pvalue<0.05,1:2], 'p53-GO-up0.05.txt', quote=FALSE, sep='\t', row.names=FALSE, col.names=FALSE)

...and uploaded to the REVIGO tool which collapses and summarises redundant GO terms. Copy the contents of the p53-GO-up0.05.txt into the REVIGO box:

After running, select the 'Scatterplot & Table' tab and scroll down to 'export results to text table (csv)':

Save the downloaded file as REVIGO-UP.csv. The package ggplot2 can be used here to visualise the log<10> p-values for GO term enrichment. To view the top 10 terms:

#Read in the REVIGO output
revigoUP = read.table('REVIGO-UP.csv', sep = ',', header = TRUE)

#Sort by p-value and extract the top 20
revigoUP = revigoUP[order(revigoUP$log10.p.value),]
revigoUP = head(revigoUP, n = 20)

#Convert the GO terms to factors for compatability with ggplot2
revigoUP$description <- factor(revigoUP.108top$description, levels = revigoUP.108top$description)

#Plot the barplot
p <- ggplot(data = revigoUP.108top, aes(x = log10.p.value, y = description, fill = description)) +
  geom_bar(stat = "identity") 

p + scale_fill_manual(values = rep("steelblue2", dim(revigoUP.108top)[1])) + theme_minimal() + theme(legend.position = "none") + 
        ylab('')

Gene Set Enrichment Analysis

Gene set enrichment analysis (GSEA) will be used to test for the altered expression of pre-defined set of genes.

BiocManager::install("piano")
library(piano)

Resources

Many resources were used in building this RNA-seq tutorial.

Highly recommended RNA-seq tutorial series:

Introduction to differential gene expression analysis

cqn Normalisation

The count data needs to be normalised for several confounding factors. The number of DNA reads (or fragments for paired end data) mapped to a gene is influeced by (1) its gc-content, (2) its length and (3) the total library size for the sample. There are multiple methods used for normalisation. Here, conditional quantile normalisation (cqn) is used as recommended by Mandelboum et al. (2019) to correct for sample-specific biases. Cqn is described by Hansen et al. (2012).

cqn requires an input of gene length, gc content and the estimated library size per sample (which it will estimate as the total sum of the counts if not provided by the user). For more guidance on how to normalise using cqn and import into DESeq2, the user is directed to the cqn vignette by Hansen & Wu and the tximport vignette by Love, Soneson & Robinson.

#Read in the gene lengths and gc-content data frame (provided in this repository)
genes.length.gc = read.table('gencode-v36-gene-length-gc.txt', sep = '\t')

At this stage, technical replicates can be combined if they have not been already. This is typically achieved by summing the counts.

To carry out the normalisation:

library(cqn)
#cqn normalisation
counts = counts.imported$counts

#Exclude genes with no length information, for compatibility with cqn.
counts = counts[-which(is.na(genes.length.gc[rownames(counts),]$length)),]

#Extract the lengths and GC contents for genes in the same order as the counts data-frame
geneslengths = genes.length.gc[rownames(counts),]$length
genesgc = genes.length.gc[rownames(counts),]$gc

#Run the cqn normalisation 
cqn.results <- cqn(counts, genesgc, geneslengths, lengthMethod = c("smooth"))

#Extract the offset, which will be input directly into DEseq2 to normalise the counts. 
cqnoffset <- cqn.results.DEseq$glm.offset
cqnNormFactors <- exp(cqnoffset)

#The 'counts' object imported from tximport also contains data-frames for 'length' and 'abundance'.
#These data-frames should also be subset to remove any genes excluded from the 'NA' length filter
counts.imported$abundance = counts.imported$abundance[rownames(counts),]
counts.imported$counts = counts.imported$counts[rownames(counts),]
counts.imported$length = counts.imported$length[rownames(counts),]

The normalised gene expression values can be saved as a cqn output. These values will not be used for the downstream differential expression, rather they are useful for any visualisation purposes. Differential expression will be calculated within DEseq2 using a negative bionomial model, to which the cqn offset will be added.

#The normalised gene expression counts can be saved as:
RPKM.cqn <- cqn.results$y + cqn.results$offset

Biological replicate correlation

The correlation between the expression of genes in two biological replicates should ideally be very high. The normalised expression values, saved above as RPKM.cqn will be used.

#To test the correlation between the first two samples in columns 1 and 2
plot(RPKM.cqn[,1], RPKM.cqn[,2], pch = 18, cex = 0.5, xlab = colnames(RPKM.cqn)[1], ylab = colnames(RPKM.cqn)[2])

#The Pearson correlation coefficient can be calculated as:
cor(RPKM.cqn[,1], RPKM.cqn[,2])

#Add it to your plot, replacing x and y with the coordinates for your legend
text(x, y, labels = paste0('r=', round(cor(RPKM.cqn[,1], RPKM.cqn[,2]),2)))

#To add a regression line
abline(lm(RPKM.cqn[,1] ~ RPKM.cqn[,2]), col = 'red')

CebolaLab / RNA-seq