An R package for finding non-adenosine poly(A) residues in Oxford Nanopore direct RNA sequencing reads
- It works on Oxford Nanopore direct RNA sequencing reads basecalled by Guppy software
- It requires tail delimitation data produced by Nanopolish software
- It allows both for the detection of non-adenosine residues within the poly(A) tails and visual inspection of read signals
Currently, ninetails can distinguish characteristic signatures of four types of nucleotides: adenosines (A), cytosines (C), guanosines (G), and uridines (U).
Ninetails relies on Nanopolish segmentation and therefore may underestimate terminal modifications (last and penultimate nucleotides of the tail).
The software is still under development, so all suggestions to improving it are welcome. Please note that the code contained herein may change frequently, so use it with caution.
Please be aware that signal transformations performed during analysis can place a heavy load on memory. This is especially true if your data covers the entire sequencing run. For the moment, ninetails does not offer the possibility of processing large data sets in chunks behind the scenes (under development). Therefore, to minimise the risk of unexpected crashes, it is highly recommended to split the output of the Nanopolish
polyA function into smaller files to make it easier to process the data in subsets and then merge the final results.
Ninetails requires the following input data to operate:
- multifast5 files basecalled by
Guppy
- for the signal data extraction - sequencing_summary.txt file - for file ID extraction
- an output of
Nanopolish
polya function (tsv file) - to obtain the tail segmentation data
Therefore, please make sure that the third-party software necessary for the steps preceding the use of ninetails is installed (Nanopolish
, Guppy
) and/or that you have all the required input files.
Currently, ninetails does not support single fast5 files as this format is deprecated by ONT. Before running the program on single fast5 files, you should convert them to multifast5 with another tool, for instance with ont-fast5-api
.
The neural network in ninetails uses the tensorflow backend, so it is necessary to install it before running the program.
Instructions for installing tensorflow
& keras
can be found here:
https://tensorflow.rstudio.com/install/
Ninetails requires also rhdf5
package for accessing & browsing files in fast5 format. It can be installed from Bioconductor (version available on CRAN is incompatible with newer R versions). The complete guide is available here: https://bioconductor.org/packages/release/bioc/html/rhdf5.html
install.packages("BiocManager")
BiocManager::install("rhdf5")
Currently, ninetails is not available on CRAN/Bioconductor, so you need to install it using devtools
.
If you do not have devtools
installed already, you can do this with the following command in R/RStudio:
install.packages("devtools")
Once you have devtools
installed, you can install ninetails using the command below in R/RStudio:
devtools::install_github('LRB-IIMCB/ninetails')
library(ninetails)
check_tails()
is the main function which allows to classify sequencing reads based on presence/absence of non-adenosine residues within their poly(A) tails (and additional conditions, such as minimal read length and qc_tag assigned by Nanopolish polya function).
Below is an example of how to use check_tails()
function:
results <- ninetails::check_tails(
nanopolish = system.file('extdata',
'test_data',
'nanopolish_output.tsv',
package = 'ninetails'),
sequencing_summary = system.file('extdata',
'test_data',
'sequencing_summary.txt',
package = 'ninetails'),
workspace = system.file('extdata',
'test_data',
'basecalled_fast5',
package = 'ninetails'),
num_cores = 2,
basecall_group = 'Basecall_1D_000',
pass_only=TRUE,
save_dir = '~/Downloads')
This function returns a list consisting of two tables: read_classes and nonadenosine_residues. In addition, the function saves results to text files in the user-specified directory.
Moreover, the function also creates a log file in the directory specified by the user.
The ninetails pipeline may be also launched without the wrapper - as sometimes it might be useful, especially if the input files are large and/or you would like to plot some produced matrices.
The first function in processing pipeline is create_tail_feature_list()
. It extracts the read data from the provided outputs and merges them based on read identifiers (readnames). This function works as follows:
tfl <- ninetails::create_tail_feature_list(
nanopolish = system.file('extdata',
'test_data',
'nanopolish_output.tsv',
package = 'ninetails'),
sequencing_summary = system.file('extdata',
'test_data',
'sequencing_summary.txt',
package = 'ninetails'),
workspace = system.file('extdata',
'test_data',
'basecalled_fast5',
package = 'ninetails'),
num_cores = 2,
basecall_group = 'Basecall_1D_000',
pass_only=TRUE)
The second function, create_tail_chunk_list()
, segments the reads and produces a list of segments in which a change of state (move = 1) along with significant local signal anomaly (so-called "pseudomove") has been recorded, possibly indicating the presence of a non-adenosine residue.
tcl <- ninetails::create_tail_chunk_list(tail_feature_list = tfl,
num_cores = 2)
The list of fragments should be then passed to the function create_gaf_list()
, which transforms the signals into gramian angular fields (GAFs). The function outputs a list of arrays (100,100,2). First channel of each array consists of gramian angular summation field (GASF), while the second channel consists of gramian angular difference field (GADF).
gl <- ninetails::create_gaf_list(tail_chunk_list = tcl,
num_cores = 2)
The penultimate function, predict_gaf_classes()
, launches the neural network to classify the input data. This function uses the tensorflow backend.
pl <- ninetails::predict_gaf_classes(gl)
The last function, create_outputs()
, allows to obtain the final output: a list composed of read_classes (reads are labelled accordingly as "modified", "unmodified" and "unclassified" based on applied criteria) and nonadenosine_residues (detailed positional info regarding detected nonadenosine residues) data frames. Note that in this form the function does not automatically save data to files.
out <- ninetails::create_outputs(
tail_feature_list = tfl,
tail_chunk_list = tcl,
nanopolish = system.file('extdata',
'test_data',
'nanopolish_output.tsv',
package = 'ninetails'),
predicted_list = pl,
num_cores = 2,
pass_only=TRUE)
column name | content |
---|---|
readname | an identifier of a given read (36 characters) |
polya_length | tail length estimation provided by nanopolish polya function |
qc_tag | quality tag assigned by nanopolish polya function |
class | the crude result of classification |
comments | a detailed description of met/unmet classification criteria |
The class
column contains information whether the given read was recognized as modified (containing non-adenosine residue) or not. Whereas the comment
column contains details underlying the classification outcome. These columns may contain a following content:
class | comments |
---|---|
modified | move transition present, nonA residue detected |
unmodified | move transition absent, nonA residue undetected |
unmodified | move transition present, nonA residue undetected |
unclassified | nanopolish qc failed |
unclassified | not included in the analysis (pass only = T) |
unclassified | insufficient read length |
column name | content |
---|---|
readname | an identifier of a given read (36 characters) |
prediction | the result of classification (basic model: C, G, U assignment) |
est_nonA_pos | the approximate nucleotide position where nonadenosine is to be expected; reported from 5' end |
polya_length | the tail length estimated according to Nanopolish polya function |
qc_tag | quality tag assigned by nanopolish polya function |
Ninetails has built-in functions plot_squiggle()
and plot_tail_range()
for plotting whole reads and the poly(A) tail region, respectively.
With their help you can visualise:
- raw signal (
rescale=FALSE
) - signal scaled to picoamperes [pA] per second [s] (
rescale=TRUE
)
In addition, the graphs can depict only signal data (moves=FALSE
) or they can also contain informations regarding the change of state between the subsequent k-mers (moves) (moves=TRUE
).
The following example demonstrates how to use the plot_squiggle()
function:
plot <- ninetails::plot_squiggle(
readname = "0226b5df-f9e5-4774-bbee-7719676f2ceb",
nanopolish = system.file('extdata',
'test_data',
'nanopolish_output.tsv',
package = 'ninetails'),
sequencing_summary = system.file('extdata',
'test_data',
'sequencing_summary.txt',
package = 'ninetails'),
workspace = system.file('extdata',
'test_data',
'basecalled_fast5',
package = 'ninetails'),
basecall_group = 'Basecall_1D_000',
moves = FALSE,
rescale = TRUE)
print(plot)
The output of the above command is the following graph:
If the (moves=TRUE
) argument is passed, then the graph also contains information regarding moves, which looks as follows:
The plot_tail_range()
function accepts the same arguments as the abovementioned function. An example usage looks as follows:
plot <- ninetails::plot_tail_range(
readname = "0226b5df-f9e5-4774-bbee-7719676f2ceb",
nanopolish = system.file('extdata',
'test_data',
'nanopolish_output.tsv',
package = 'ninetails'),
sequencing_summary = system.file('extdata',
'test_data',
'sequencing_summary.txt',
package = 'ninetails'),
workspace = system.file('extdata',
'test_data',
'basecalled_fast5',
package = 'ninetails'),
basecall_group = 'Basecall_1D_000',
moves = TRUE,
rescale = TRUE)
print(plot)
Or below, if the (moves=TRUE
) argument is passed:
Ninetails allows you to visualise a particular fragment among the list of fragments generated by the create_tail_chunk_list()
function. This is what the function plot_tail_chunk()
is for. This function only allows to preview the raw signal, currently there is no built-in scaling to picoamperes [pA].
An example of how the plot_tail_chunk()
function works:
example <- ninetails::plot_tail_chunk(
chunk_name = "5c2386e6-32e9-4e15-a5c7-2831f4750b2b_1",
tail_chunk_list = tcl)
print(example)
And an example output:
The package allows to create a visual representation of gramian angular fields (GAFs) using ggplot2
.
The plot_gaf()
function draws a single matrix of interest. It requires the name of a particular segment and a list of matrices produced by the create_gaf_list()
function as an input.
Below is an example of the usage of the plot_gaf()
function. Please note that in order for this example to work properly, one must first execute the 3 first commands from the Classification of reads using standalone functions section.
example_gaf <- ninetails::plot_gaf(
gaf_name = "5c2386e6-32e9-4e15-a5c7-2831f4750b2b_1",
gaf_list = gl,
save_file = TRUE)
print(example_gaf)
And here is an example output:
Ninetails also allows the user to plot the entire list of matrices produced by the create_gaf_list()
function at once. The files will be saved in the current working directory. An example of usage is given below:
ninetails::plot_multiple_gaf(gaf_list = gl,
num_cores = 10)
And this results in multiple plots, like this:
However, it is advisable to use this function with caution. The data contained in a gaf_list
object tends to be large. Drawing multiple graphs at once may overload the system and cause it to crash.
Please cite ninetails as: Gumińska N., Matylla-Kulińska K., Krawczyk P., Orzeł W., Maj M., Mroczek S., Dziembowski A. Direct detection of non-adenosine nucleotides within poly(A) tails – a new tool for the analysis of post-transcriptional mRNA tailing, 27th Annual Meeting of the RNA Society, Boulder, Colorado, May 31 to June 5, 2022.
Preprint is in the preparation.
- model finetuning
- optimized positioning (position calibration)
- port to Python
If you encounter a bug, please post it on github. To help diagnose the problem, send a minimal reproducible example (required inputs covering around 5-10 reads + corresponding nanopolish output & sequencing summary), so I will be able to reproduce the error and fix it for you.
Any issues regarding the ninetails should be addressed to Natalia Gumińska (nguminska (at) iimcb.gov.pl).