olarerin/metaPlotR

Summary

MetaPlotR is a Perl/R pipeline for creating metagene plots. A metagene is a density plot or histogram of sites of interest (e.g. protein binding sites or RNA modifications) along a simplified transcript model containing a 5’UTR, coding sequence and 3’UTR.

Requirements

Unix/Linux based operating system (tested with Debian 7.8 and OS X 10.10.5)
Perl (tested with version 5.22.2)
R (tested with version 3.2.2); "scales" and "ggplot2" packages
Bedtools (tested with version 2.22.1)

Prepare primary data

Create query bed file

A six-column BED file (i.e. BED6) is required (see here for specifications -- Note: columns 4 and 5 can be arbitrary). This tutorial uses a BED file of N6-methyladenosine (m6A) sites generated from Linder et al. Nat. Methods, 2015 (m6a.sorted.bed) and one of pseudouridine sites (pseudoU) genrated from Schraga et al. Cell, 2014 (pseudoU.sorted.bed). The sample BED files are located in the Github repository along with the MetaPlotR scripts. These files were sorted using the Unix sort command:

sort -k1,1 -k2,2n m6a.bed > m6a.sorted.bed

sort -k1,1 -k2,2n pseudoU.bed > pseudoU.sorted.bed

NOTE: MetaPlotR expects a bed file with 0-based single nucleotide coordinates.

Download genome and annotation file

Download genome of interest from the UCSC genome browser download page (http://hgdownload.soe.ucsc.edu/downloads.html). Here we use the hg19 human genome located here (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz).

Next, download the extended gene prediction tables from the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables). The figure below shows the necessary drop-down options to download the gencode gene annotations for the hg19 human genome.

Pre-process data

make_annot_bed.pl creates a master annotation file (bed format) of every nucleotide in the transcriptome. The script is supplied with the locations of the genome directory (chroms/) and the gene prediction table (hg19_gencode.genePred):

perl make_annot_bed.pl --genomeDir chroms/ --genePred hg19_gencode.genePred > hg19_annot.bed

Sort the master annotation file using the unix sort command:

sort -k1,1 -k2,2n hg19_annot.bed > hg19_annot.sorted.bed

size_of_cds_utrs.pl creates a file cataloging the transcriptomic coordinates of the start and end sites of the transcript regions (i.e. 5’UTR, CDS and 3’UTR). It takes the sorted master annotation file as input (hg19_annot.sorted.bed) and outputs a region annotation file. The region annotation file is necessary for determining the distance of queried sites from the transcriptomic features (i.e. transcriptional start site, start codon, stop codon and transcript end).

perl size_of_cds_utrs.pl --annot hg19_annot.sorted.bed > region_sizes.txt

annotate_bed_file.pl annotates the user supplied BED file (demonstrated with m6a.sorted.bed) containing single nucleotide genomic coordinates of sites of interest. It serves as a wrapper for Bedtools Intersect and essentially labels every line in the user supplied BED file with the matching line (i.e. same coordinates) in the master annotation file (hg19_annot.sorted.bed). The outputted file is called the annotated query file.

perl annotate_bed_file.pl --bed m6a.sorted.bed --bed2 hg19_annot.sorted.bed > annot_m6a.sorted.bed

Alternatively, Bedtools intersect can be evoked directly using the command:

intersectBed -a m6a.sorted.bed -b hg19_annot.sorted.bed -sorted -wo -s > annot_m6a.sorted.bed

rel_and_abs_dist_calc.pl identifies the region of the transcript in which the user supplied sites fall and converts the transcriptomic coordinates to metagene coordinates. Namely, sites that occur in the 5’UTR have a value from 0 to 1, where 0 and 1 represent the 5’ and 3’ ends of the 5’UTR, respectively. Similarly, sites in the CDS have a value from 1 to 2 and the 3’UTR 2 to 3. The script takes as input the annotated query file annot_miclip.cims.bed and the region annotation file utr_cds_ends.txt. The outputted distance measure file contains all the values needed to plot the metagenes.

perl rel_and_abs_dist_calc.pl --bed annot_m6a.sorted.bed --regions region_sizes.txt > m6a.dist.measures.txt

Understanding the distance measure file

All proceeding code are in R (https://www.r-project.org/). We recommend working with R using RStudio (https://www.rstudio.com/). The input for this section is the metagene coordinates file outputted from rel_and_abs_dist_calc.pl (available on Github)

Read in distance measure file for m6A

m6a.dist <- read.delim ("m6a.dist.measures.txt", header = T)

View the number of rows and columns in the dataset

dim(m6a.dist)

## [1] 20903    14

View the first few lines

head(m6a.dist)

##    chr  coord gene_name          refseqID rel_location utr5_st utr5_end
## 1 chr1 878151    SAMD11 ENST00000342066.3     1.624145    1359     1277
## 2 chr1 879955     NOC2L ENST00000327044.6     2.242857    2418     2369
## 3 chr1 934375      HES4 ENST00000304952.6     2.673684     867      730
## 4 chr1 934375      HES4 ENST00000428771.2     2.659794    1006      808
## 5 chr1 934375      HES4 ENST00000484667.2     2.719101     641      634
## 6 chr1 934423      HES4 ENST00000304952.6     2.168421     819      682
##   cds_st cds_end utr3_st utr3_end utr5_size cds_size utr3_size
## 1   1276    -769    -770    -1191        82     2045       421
## 2   2368     119     118     -371        49     2249       489
## 3    729      64      63      -31       137      665        94
## 4    807      64      63      -33       198      743        96
## 5    633      64      63      -25         7      569        88
## 6    681      16      15      -79       137      665        94

This input file contains 20903 rows and 14 columns. Each row represents a single site (in this example an m6A site). The column headers for the first four columns are self explanatory. The fifth column "rel_location" (for relative location) contains the calculated metagene coordinates. In its simplest form (i.e. non-normalized), the metagene coordinates from 0 to 1 represent the 5'UTR with 0 being closer to the beginning of the 5'UTR and 1 closer to the end. Similarly, 1 to 2 represents the CDS and 2 to 3 the 3'UTR. A histogram/density plot of the "rel_location" value gives the standard metagene.

In addition to the standard metagene which is based on the relative location of sites in transcripts, this next six columns (utr5_st, utr5_end, cds_st, cds_end, utr3_st, utr3_end) contain information for plotting the absolute distance of sites from several points of interest. For example, in this dataset the third row has a value of +63 under column header "utr3_st". That means the site is 63 nucleotides upstream of the 3'UTR start site.

The last three columns contain the lengths of the 5'UTRs, coding sequences and 3'UTRs.

Selecting gene isoforms for metagene analysis

The dataset is redundant -- a given site is represented by multiple transcript isoforms. The choice of which isoforms to choose should be informed by the underlying biology. For example, if a gene expression dataset is available, one option may be to pick the highest expressed isoform. Another option is to pick the longest isoform, which is likely to capture more sites. Below is sample code for picking the longest isoforms

# Determine longest length transcript for each gene
trx_len <- m6a.dist$utr5_size + m6a.dist$cds_size + m6a.dist$utr3_size # Determine transcript length
temp <- data.frame(m6a.dist$gene_name, m6a.dist$refseqID, trx_len)
colnames(temp) <- c("gene_name", "gid", "trx_len") 
temp.df <- temp[order(temp$gene_name, temp$gid, -temp$trx_len),]
temp.df <- temp[!duplicated(temp$gene_name),]

# limit m6a data to one transcript per gene (longest)
m6a.dist <- m6a.dist[m6a.dist$refseqID %in% temp.df$gid,]

# View size of our dataset (rows, columns)
dim(m6a.dist)

## [1] 20903    14

Visualizing the metagene

A simple histogram

library (ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

qplot(m6a.dist$rel_location, geom="histogram") + geom_vline(xintercept = 1:2, col = "grey") + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this plot, the range 0 to 1 represents the 5'UTR, 1 to 2 the CDS, and 2 to 3 the 3'UTR (as delineated by the red vertical lines). From this figure, one may conclude that the events (in this case m6A sites) occur througout the gene body with a peak around the stop codon and a precipitous transition from the 5'UTR to the CDS. However, one caveat is that the three regions of interest are drawn with equal widths. On average, this is not the case. We can view the average lenghts in this dataset:

summary(data.frame(m6a.dist$utr5_size, m6a.dist$cds_size, m6a.dist$utr3_size))

##  m6a.dist.utr5_size m6a.dist.cds_size m6a.dist.utr3_size
##  Min.   :    0.0    Min.   :   38     Min.   :    0     
##  1st Qu.:   94.0    1st Qu.:  794     1st Qu.:  308     
##  Median :  195.0    Median : 1394     Median :  718     
##  Mean   :  275.7    Mean   : 1865     Mean   : 1173     
##  3rd Qu.:  347.0    3rd Qu.: 2318     3rd Qu.: 1541     
##  Max.   :14959.0    Max.   :26393     Max.   :17331

The median lengths are 195, 1394, and 718 for the 5'UTR, CDS and 3'UTR, respectively.

To account for these length differences in the metagene, we can re-scale the widths of the 5'UTR and 3'UTR relative to the CDS (which is set constant to a width of 1 unit). So first we calculate a simple scale factor (SF):

utr5.SF <- median(m6a.dist$utr5_size, na.rm = T)/median(m6a.dist$cds_size, na.rm = T)
utr3.SF <- median(m6a.dist$utr3_size, na.rm = T)/median(m6a.dist$cds_size, na.rm = T)

The SF for the 5'UTR is 0.14 and for the 3'UTR is 0.52. Next, the following code rescales these regions accordingly:

# assign the regions to new dataframes
utr5.m6a.dist <- m6a.dist[m6a.dist$rel_location < 1, ]
cds.m6a.dist <- m6a.dist [m6a.dist$rel_location < 2 & m6a.dist$rel_location >= 1, ]
utr3.m6a.dist <- m6a.dist[m6a.dist$rel_location >= 2, ]

# rescale 5'UTR and 3'UTR
library("scales")

## Warning: package 'scales' was built under R version 3.1.3

utr5.m6a.dist$rel_location <- rescale(utr5.m6a.dist$rel_location, to = c(1-utr5.SF, 1), from = c(0,1))
utr3.m6a.dist$rel_location <- rescale(utr3.m6a.dist$rel_location, to = c(2, 2+utr3.SF), from = c(2,3))

Finally, plot the metagene with the rescaled UTRs

# Combine and plot
## Histogram
m6a.metagene.coord <- c(utr5.m6a.dist$rel_location, cds.m6a.dist$rel_location, utr3.m6a.dist$rel_location)
p <- qplot(m6a.metagene.coord, geom="histogram") + geom_vline(xintercept = 1:2, col = "grey") + theme_bw()

Alternate representations of the metagene

A frequency polygon

qplot(m6a.metagene.coord, geom="freqpoly") + geom_vline(xintercept = 1:2, col = "red") + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A smooth density plot

qplot(m6a.metagene.coord, geom="density") + geom_vline(xintercept = 1:2, col = "red") + theme_bw()

Comparing multiple metagene plots

It is often desireable to compare two or more metagene plots. These plots can represent different experimental conditions (e.g. control vs treatment) or even different groups within the same dataset (e.g. highly vs lowly expressed genes). Below we demonstrate a comparsion of the metagene for m6A to that of pseudouridine.

Read in pseudouridine metagene coordinates file

Note: The pseudoU metagene coordinates data was derived from the input file "pseudo.dist.measures.txt" as above for the m6A dataset. It is provided here as a file to load directly for convenience. Otherwise, you can generate it as well by repeating the steps of this tutorial for the pseudouridine dataset.

# Read in metagene coordinates 
pseudoU.metagene.coord <- t(read.delim ("pseudoU.meta.coord.txt", header = T))

Draw combined plots of m6A and pseudouridine metagene

metagene.cord <- c(m6a.metagene.coord, pseudoU.metagene.coord)
mod <- c(rep("m6A", length(m6a.metagene.coord)), 
         rep("pseudoU", length(pseudoU.metagene.coord))) 
df <- data.frame(metagene.cord, mod)

ggplot(df) + geom_density(aes(x = metagene.cord, colour = mod)) + xlim(0, 3) + 
  theme_bw() + geom_vline(xintercept = 1:2, col = "grey")

As you can see, the pseudouridine metagene trace does not contain the prominent stop codon proximal peak seen in the m6A metagene, but is instead somewhat more uniform in the CDS and 3'UTR.

Mapping the absolute distance of sites from fixed features

An alternative to the metagene plot is a feature distance plot which shows the absolute distance of sites from a given transcriptomic feature (e.g. stop codon, transcription start site, etc). As discussed earlier, columns 6-11 of the dataset contains the absolute distance data.

head(m6a.dist[,6:11])

##   utr5_st utr5_end cds_st cds_end utr3_st utr3_end
## 1    1359     1277   1276    -769    -770    -1191
## 2    2418     2369   2368     119     118     -371
## 3     867      730    729      64      63      -31
## 4    1006      808    807      64      63      -33
## 5     641      634    633      64      63      -25
## 6     819      682    681      16      15      -79

For example, we can view the distribution of sites within 500 nucleotides of the stop codon:

qplot(m6a.dist$utr3_st, geom="histogram") + xlim (-500,500) + geom_vline(xintercept = 1:2, col = "grey") + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 8151 rows containing non-finite values (stat_bin).

Final remarks

R is a powerful language and there are many customizations that can be made to all the plots shown above. This tutorial was meant to serve as a starting point for creating metagenes and exploring the underlying data using MetaPlotR.

olarerin / metaPlotR