rRMSAnalyzer: package to analyze RiboMethSeq data
RiboMethSeq is an RNAseq-based approach to analyze 2’O-ribose methylation (2’Ome). rRMSAnalyzer is an R package that provides a set of easy-to-use functions to compute C-scores from RiboMethSeq read end counts as input, adjust batch effect with ComBat-Seq, visualize the data and provide a table with the annotated human rRNA sites and their C-scores.
Note We have also developed a dedicated Nextflow pipeline to process the data from sequencing output (fastq files) to useful raw data for rRMSAnalyzer (read end counts).
- Installation
- Usage
- Help, bug reports and suggestions
- Acknowledgements
- Funding
- Data
- C-score calculation
- Quality control
- Sample manipulation
- RNA annotation
- Data visualization with PCA
- Export data
Installation
The latest version of rRMSAnalyzer package can be installed from Github with :
library(devtools)
devtools::install_github("RibosomeCRCL/rRMSAnalyzer")
Usage
library(rRMSAnalyzer)
ribo <- load_ribodata(
count_path = "/path/to/your/csvfiles/directory/",
metadata = "path/to/metadata.csv",
metadata_key = "filename",
metadata_id = "samplename")
# Compute the c-score using different parameters,
# including calculation of the local coverage using the mean instead of the median
ribo <- compute_cscore(ribo, method = "mean")
# If necessary, adjust any technical biases using ComBat-Seq.
# Here, as an example, we use the "library" column in metadata.
ribo <- adjust_bias(ribo,"library")
# Plot a Principal Component Analysis (PCA) whose colors depend on the "condition" column in metadata
plot_PCA(ribo,"condition")
Help, bug reports and suggestions
To report a bug or any suggestion to improve the package, please let us known by opening a new issue on: github issue link coming soon!
Acknowledgements
We would like to thank all our collaborators for their advices and suggestions.
Funding
This project has been funded by the French Cancer Institute (INCa, PLBIO 2019-138 MARACAS), the SIRIC Program (INCa-DGOS-Inserm_12563 LyRICAN) and Synergie Lyon Cancer Foundation.
Data
RiboClass
RiboClass is the main class of the package that allows to store both the data matrices (counts and c-scores) and the associated metadata. It is automatically created when calling load_ribodata (see Loading data).
It is a list containing three main elements, individually described below:
-
Data: a list of dataframe, containing for each sample the 5’ and/or 3’ read-end counts provided by the user, and the calculated c-score.
-
Metadata: a dataframe, containing all the information related to the samples that can be provided by the user.
-
RNA_names: a dataframe, reporting the names of the RNA used in Data.
Some main function calls’ parameters (such as the normalization method used for c-score computation) are also kept in the RiboClass object as a reminder.
Loading data
Here is an example, where all load_ribodata parameters are shown :
path <- system.file("extdata", package="rRMSAnalyzer")
ribo <- load_ribodata(
#data & metadata files path
count_path = file.path(path,"miniglioma/"),
metadata = file.path(path,"metadata.csv"),
# data & metadata files separator
count_sep = "\t",
metadata_sep = ",",
# count data parameters :
count_header = FALSE,
count_value = 3,
count_rnaid = 1,
count_pos = 2,
# Metadata parameters :
metadata_key = "filename",
metadata_id = "samplename",
# c-score parameters :
flanking = 6,
method = "median",
ncores = 1)
#> [SUCCESS] Your data have been imported and the following RiboClass has been created :
#> a RiboClass with 14 samples and 4 RNA(s) :
#> Name : NR_023363.1_5S, length : 121
#> Name : NR_046235.3_5.8S, length : 157
#> Name : NR_046235.3_18S, length : 1869
#> Name : NR_046235.3_28S, length : 5070
Data
To use this package, the user should provide one csv file with the 5’, 3’ or 5’/3’ read end counts resulting from RiboMethSeq data per sample. The folder structure containing the csv files is not important, as long as either the directory and its sub-directories contain the necessary csv files.
-
The name of the RNA on which the read end counting has been performed.
-
The number of the position on the RNA.
-
The value of the read end counts at the position.
Here is an example :
RNA | Position on RNA | read end count |
---|---|---|
18S | 123 | 3746 |
18S | 124 | 345 |
18S | 125 | 324 |
18S | 126 | 789 |
18S | 127 | 1234 |
Note 1: it is not necessary to provide an header in the count files, because column index can be used in the function load_ribodata, using count_value, count_rnaid and count_pos parameters.
Note 2: if no metadata is specified, rRMSAnalyzer will try to fetch any csv files in the folder specified in count_path and its subfolders.
Metadata
Two columns are mandatory for the metadata dataframe:
-
filename: name of the csv file on disk containing the read end counts described above. Do not modify it unless the filename has changed on disk.
-
samplename: rename the samples, which will be analyzed and shown on the plots. This column could be modified, as long as the sample names are unique.
Following these two mandatory columns, provide as many columns as needed for the analysis.
Here is an example of metadata for three samples:
filename (mandatory) |
samplename (mandatory) |
biological condition (optionnal) |
---|---|---|
sample1.csv | sample 1 | condition 1 |
sample2.csv | sample 2 | condition 1 |
sample3.csv | sample 3 | condition 2 |
Note: if no metadata are provided in load_ribodata, an empty metadata will be created with pre-completed “filename” and “samplename” columns. The “samplename” column will be identical to “filename”, but can be modified by the user.
Here is an example of auto-generated metadata:
filename | samplename |
---|---|
sample1.csv | sample1.csv |
sample2.csv | sample2.csv |
sample3.csv | sample3.csv |
RNA names
RNA names are stored in an auto-generated dataframe when loading data. It contains two columns :
-
original_name: original name of each RNA (e.g NR_023363.1).
-
current_name: current name of each RNA, reflecting any user’s change with rename_rna function (see Rename RNA).
This dataframe is used to keep track of the original name, which often contains NCBI’s accession ID.
Here is an example:
original_name | current_name |
---|---|
NR_023363.1_5S | 5S |
NR_046235.3_5.8S | 5.8S |
NR_046235.3_18S | 18S |
NR_046235.3_28S | 28S |
C-score calculation
What is C-score
The C-score corresponds to the 2’Ome level at a RNA position. The C-score represents a drop in the end read coverage at a given position compared to the environmental coverage, as described by @birkedal2014. The C-score can be of 0 (i.e., no RNA molecule is 2’Ome at the position of interest), of 1 (i.e., all the RNA molecules are 2’Ome at the position of interest) and of ]0:1[ (i.e., a mix of un-methylated and methylated RNA molecules).
Different C-scores can be determined to obtain robust estimation of 2’Ome level depending on the parameters used to compute the local coverage. In particular, the computation method and the size of the local coverage can be changed.
By default, the computation method of the local coverage corresponds to the median and the size of the local coverage corresponds to a flanking region of 6 (i.e., 6 nucleotides downstream the nucleotide n and 6 upstream the nucleotide n). This package offers the possibility to change these two parameters either when loading the data or during the analysis process.
C-score computation when loading data
When the function load_ribodata is used, a C-score is automatically computed for all genomic positions of the RNA. The C-score is computed using either the default parameters of the load_ribodata function or the user’s parameters as followed:
load_ribodata(count.path = "/path/to/csv/",
metadata = "/path/to/metadata.csv",
# everything below is linked to c-score computation
flanking = 6, # flanking region size
method = "median", # use mean or median on flanking region's values
ncores = 8 # number of CPU cores to use for computation
)
C-score computation during the analysis process
During the analysis, the parameters to compute the C-score can be modified using the compute_cscore function, which will automatically update the C-score in the RiboClass.
In the following example, both the flanking region’s size of the local coverage and the computation method have been modified:
ribo <- compute_cscore(ribo,
flanking = 4,
method = "mean")
Note: this function will override the previous c-score of the RiboClass.
Quality control
Identification of batch effect
Technical bias (i.e., batch effect) can be identified by plotting C-scores at all the genomic positions of the RNA for each sample on a PCA (see also [Visualization with PCA] for more usages).
Here is an example:
plot_PCA(ribo = ribo,
color_col = "run")
In this example, the technical replicates RNA1 and RNA2 included in library 1 and 2 respectively, are distant from each other on the PC1 axis. Moreover, the samples should not be grouped by library or batch. The following section will resolve this batch effect.
Batch effect adjustment
Batch effect of RiboMethSeq data can be adjusted using the ComBat-seq method (Paraqindes et al, in preparation; @zhang2020) . The rRMSAnalyzer package includes a wrapper (adjust_bias) to perform ComBat-seq adjustment that will return a new RiboClass with adjusted read end count values and C-scores automatically recomputed with the same setup parameters.
ribo_adjusted <- adjust_bias(ribo, batch = "run")
#> Found 2 batches
#> Using null model in ComBat-seq.
#> Adjusting for 0 covariate(s) or covariate level(s)
#> Estimating dispersions
#> Fitting the GLM model
#> Shrinkage off - using GLM estimates for parameters
#> Adjusting the data
#> Recomputing c-score with the following parameters :
#> - C-score method : median
#> - Flanking window : 6
Batch effect adjustment can be verified using the plot_PCA function using the new RiboClass:
plot_PCA(ribo_adjusted,"run")
After batch effect adjustment using ComBat-seq method, the two technical replicates RNA1 and RNA2 show reduced dispersion, and the samples are separated on the PCA axes independently of the library groups.
Sample manipulation
Keep or remove samples
A sample subset can be easily analyzed by indicating the samples to keep or to remove. The user can thus create a new RiboClass object containing the data and metadata of the samples of interest. In both cases, only the remaining samples’ metadata are kept in the RiboClass object, so no manual updating is required.
Here is an example to generate a new RiboClass by keeping two samples of interest (“S1” and “S2”):
ribo_2samples <- keep_ribo_samples(ribo_adjusted,c("S1","S2"))
print(ribo_2samples)
#> a RiboClass with 2 samples and 4 RNA(s) :
#> Name : NR_023363.1_5S, length : 121
#> Name : NR_046235.3_5.8S, length : 157
#> Name : NR_046235.3_18S, length : 1869
#> Name : NR_046235.3_28S, length : 5070
Here is an example to generate a new RiboClass by removing two samples (“S1” and “S2”):
ribo_removed_samples <- remove_ribo_samples(ribo,c("S1","S1"))
print(ribo_removed_samples)
#> a RiboClass with 13 samples and 4 RNA(s) :
#> Name : NR_023363.1_5S, length : 121
#> Name : NR_046235.3_5.8S, length : 157
#> Name : NR_046235.3_18S, length : 1869
#> Name : NR_046235.3_28S, length : 5070
In both cases, only the remaining samples’ metadata are kept in the RiboClass object. You do not need to update it by hand.
RNA annotation
RNA manipulation
Remove RNA
A RNA subset can be easily analyzed by indicating the RNA to remove. The user can thus create a new RiboClass object containing the data of the RNAs of interest, without affecting the samples’ metadata.
Here is an example where the RNA 5S is removed:
ribo_adjusted <- remove_rna(ribo, rna_to_remove = "NR_023363.1_5S")
print(ribo_adjusted)
#> a RiboClass with 14 samples and 3 RNA(s) :
#> Name : NR_046235.3_5.8S, length : 157
#> Name : NR_046235.3_18S, length : 1869
#> Name : NR_046235.3_28S, length : 5070
Rename RNA
Annotation of rRNA 2’Ome sites using the lists provided by this package requires the usage of specific RNA names.
Here is an example to check whether the RNA names provided by the user in the RiboClass match the ones used by this package :
data("human_methylated")
cat("human_methylated's rna names: ", unique(human_methylated$rRNA),"\n")
#> human_methylated's rna names: 5.8S 18S 28S
cat("ribo's rna names: ", as.character(ribo_adjusted$rna_names$current_name))
#> ribo's rna names: NR_046235.3_5.8S NR_046235.3_18S NR_046235.3_28S
In this example, the names are different and need to be updated before annotation.
The function rename_rna updates automatically the rRNA names that are given by rRNA size order:
ribo_adjusted <- rename_rna(ribo_adjusted,
new_names = c("5.8S", "18S", "28S"))
# from the shortest RNA in our RiboClass to the longest.
Annotation of RNA 2’Ome sites
This package computes a C-score for each genomic position of the RNAs. The C-scores of the RNA 2’Ome sites of interest are then extracted using a list of annotated sites and the annotate function.
Annotation included: Human 2’Ome rRNA sites
By default, this package includes two dataframes including the positions and the annotations of the human rRNA 2’Ome sites:
-
human_methylated: a dataframe, containing the 112 known 2’Ome sites for the human rRNAs.
-
human_suspected: a dataframe, containing the 17 sites that are putative 2’Ome sites for the human rRNAs, as described in the litterature.
Customize 2’Ome sites annotations
Instead of using the list of human rRNA 2’Ome sites included with this package, the user can provide its own list and annotation using the function annotate_site.
A dataframe with three mandatory columns should be provided :
-
RNA name: the name of the RNA, matching the RNA name of the RiboClass.
-
Position on RNA: the number of the position on the RNA.
-
Nomenclature: the name given to the site of interest.
You can see an example below :
Position | rRNA | Nomenclature |
---|---|---|
15 | 5.8S | Um14 |
76 | 5.8S | Gm75 |
28 | 18S | Am27 |
Annotate RNA sites
The 2’Ome sites of interest can be annotated using the annotate function with either the provided annotations or customized ones.
Here is an example with the provided human methylated annotations:
ribo_adjusted <- annotate_site(ribo_adjusted,
annot = human_methylated,
anno_rna = "rRNA",
anno_pos = "Position",
anno_value = "Nomenclature")
Note: if an issue occurs during the annotation process, the user should check the Rename RNA section.
This vignette has also some explanations on how to create your own sites annotation dataset with Customize 2’Ome sites annotations.
Data visualization with PCA
The function plot_PCA, which returns a ggplot, has been implemented with several parameters for more flexibility:
- only_annotated: plot samples based on the annotated RNA 2’Ome sites only to determine whether samples clustered depending on their rRNA 2’Ome profile.
Here is an example comparing samples reflecting different biological conditions based on the rRNA 2’Ome profile of the provided human_methylated list:
plot_PCA(ribo_adjusted,
color_col = "condition",
only_annotated = TRUE)
- axes: by default, PC1 and PC2 axes are plotted. However, the user can choose the PCA axes of interest using the “axes” parameter.
plot_PCA(ribo_adjusted,
color_col = "condition",
axes = c(2,3), #PC2 and PC3 will be plotted
only_annotated = TRUE)
- pca_object_only: return the full dudi.pca object by setting pca_object_only to True:
pca <- plot_PCA(ribo_adjusted,
color_col = "condition",
only_annotated = TRUE,
pca_object_only = TRUE)
Export data
Data can be exported as two different objects.
Export as a dataframe
The user can export data as a dataframe using the function extract_data.
By default, it will export c-score for all the genomic RNA positions.
ribo_df <- extract_data(ribo_adjusted,
col = "cscore")
The user can export data related to the subset of annotated RNA 2’Ome sites by setting the only_annotated parameter to True.
ribo_df <- extract_data(ribo_adjusted,
col = "cscore",
only_annotated = TRUE)
site | S1 | S2 | S3 |
---|---|---|---|
18S_Am27 | 0.9783333 | 0.9745514 | 0.9810412 |
18S_Am99 | 0.9680204 | 0.9646470 | 0.9724150 |
18S_Um116 | 0.9276274 | 0.9470968 | 0.9407083 |
18S_Um121 | 0.9630216 | 0.9684459 | 0.9691576 |
18S_Am159 | 0.9629730 | 0.9620986 | 0.9686766 |
18S_Am166 | 0.9809492 | 0.9744627 | 0.9744998 |
18S_Um172 | 0.9510189 | 0.9400922 | 0.9517981 |
18S_Cm174 | 0.9119119 | 0.8668456 | 0.8806886 |
18S_Um354 | 0.9709273 | 0.9722054 | 0.9756174 |
18S_Um428 | 0.9268668 | 0.9276018 | 0.9504224 |
excerpt from the output dataframe. S1, S2 and S3 are samples.
Export as a ggplot-ready dataframe
The user can export data as a ggplot-ready dataframe using the function format_to_plot.
By default, it will export c-score for all the genomic RNA positions. The user can export additional information present in the metadata by indicating the name of the column of interest. The user can export information related to the subset of annotated RNA 2’Ome sites by setting the only_annotated parameter to True.
Here is an example of ggplot-ready dataframe including the C-scores of all the genomic RNA positions as well as the condition related to the particular sample of interest:
ggplot_table <- format_to_plot(ribo_adjusted,"condition")
sample | site | cscore | condition | |
---|---|---|---|---|
501 | RNA1 | 18S_0501 | 0.0000000 | RNA ref |
502 | RNA1 | 18S_0502 | 0.2601804 | RNA ref |
503 | RNA1 | 18S_0503 | 0.0000000 | RNA ref |
504 | RNA1 | 18S_0504 | 0.3179712 | RNA ref |
505 | RNA1 | 18S_0505 | 0.5132894 | RNA ref |
506 | RNA1 | 18S_0506 | 0.6354548 | RNA ref |
507 | RNA1 | 18S_0507 | 0.0000000 | RNA ref |
508 | RNA1 | 18S_0508 | 0.0000000 | RNA ref |
509 | RNA1 | 18S_0509 | 0.0000000 | RNA ref |
510 | RNA1 | 18S_0510 | 0.9380812 | RNA ref |
Excerpt from the output ggplot-ready dataframe
Export as a dataframe by condition
The user can export a dataframe compiling the mean C-scores of each position by specific conditions provided in the metadata dataframe using the mean_samples_by_conditon function.
By default, it will export for all the genomic RNA positions the name of the position, the mean and sd of the C-scores. The user can also export mean and sd of read end count. The user can export information related to the subset of annotated RNA 2’Ome sites by setting the only_annotated parameter to True.
Here is an example of dataframe including the mean C-scores per conditions for all the genomic RNA positions:
mean_tb <- mean_samples_by_conditon(ribo_adjusted,
value = "cscore",
metadata_condition = "condition",
only_annotated = TRUE)
site | condition | mean | sd |
---|---|---|---|
18S_Am1031 | RNA ref | 0.9749922 | 0.0047461 |
18S_Am1031 | cond1 | 0.9703699 | 0.0032725 |
18S_Am1031 | cond2 | 0.9723714 | 0.0057723 |
18S_Am1383 | RNA ref | 0.9796876 | 0.0035939 |
18S_Am1383 | cond1 | 0.9765212 | 0.0027083 |
18S_Am1383 | cond2 | 0.9779684 | 0.0028759 |
18S_Am159 | RNA ref | 0.9684126 | 0.0022025 |
18S_Am159 | cond1 | 0.9609920 | 0.0054882 |
18S_Am159 | cond2 | 0.9623362 | 0.0065239 |
18S_Am166 | RNA ref | 0.9826490 | 0.0033169 |
Excerpt from the output dataframe by condition