Pinfish is a collection of tools helping to make sense of long transcriptomics data (long cDNA reads, direct RNA reads). The toolchain is composed of the following tools:
spliced_bam2gff
- a tool for converting sorted BAM files containing spliced alignments (generated by minimap2 or GMAP) into GFF2 format. Each read will be represented as a distinct transcript. This tool comes handy when visualizing spliced reads at particular loci and to provide input to the rest of the toolchain.cluster_gff
- this tool takes a sorted GFF2 file as input and clusters together reads having similar exon/intron structure and creates a rough consensus of the clusters by taking the median of exon boundaries from all transcripts in the cluster.polish_clusters
- this tool takes the cluster definitions generated bycluster_gff
and for each cluster creates an error corrected read by mapping all reads on the read with the median length (usingminimap2
) and polishing it usingracon
. The polished reads can be mapped to the genome usingminimap2
orGMAP
.collapse_partials
- this tool takes GFFs generated by eithercluster_gff
orpolish_clusters
and filters out transcripts which are likely to be based on RNA degradation products from the 5' end. The tool clusters the input transcripts into "loci" by the 3' ends and discards transcripts which have a compatible transcripts in the loci with more exons.
Pinfish is largely inspired by the Mandalorion pipeline. It is meant to provide a quick way for generating annotations from long reads only and it is not meant to provide the same functionality as pipelines using a broader strategy for annotation (such as LoReAn).
The pinfish tools can be run via a Snakemake pipeline which handles the alignment tasks using minimap2
.
The static linux binaries for the x86_64 platform are included in the respective subdirectories of the source tree. To install them simply copy them somewhere in your path.
The polish_clusters
tool depends on the following software:
Compiling the tools from source require a working go compiler installation and the following packages installed via go get
:
After installing dependencies simply issue make
in the respective subdirectory.
Usage of spliced_bam2gff:
-M Input is from minimap2.
-V Print out version.
-g Use strand tag as feature orientation then read strand if not available.
-h Print out help message.
-s Use read strand (from BAM flag) as feature orientation.
-t int
Number of cores to use. (default 4)
The tool is looking by default for the XS
tag in order to determine transcript orientation, unless the -M
flag is specified in which case it is assumed that the input is from minimap2
and the ts
tag is used instead (with different rules to determine the final orientation).
If no orientation tag is found, then the orientation is set to .
, unless the -g
flag is provided, in which case the read orientation from the BAM flag is used.
If the -s
flag is specified all the rules above are ignored and the orientation is set to the read strand from the BAM flag (appropriate for stranded protocols).
Example run with minimap2
input:
spliced_bam2gff -M minimap_sorted.bam > raw_transcripts.gff
Example run with minimap2
input, stranded mode:
spliced_bam2gff -s minimap_sorted.bam > raw_transcripts.gff
Example run with GMAP
input:
spliced_bam2gff gmap_sorted.bam > raw_transcripts.gff
Usage of ./cluster_gff:
-V Print out version.
-a string
Write clusters in tabular format in this file.
-c int
Minimum cluster size. (default 10)
-d int
Exon boundary tolerance. (default 10)
-e int
Terminal exons boundary tolerance. (default 30)
-h Print out help message.
-p float
Minimum isoform percentage. (default 1)
-prof string
Write out CPU profiling information.
-t int
Number of cores to use. (default 4)
The -e
parameter is the maximum distance tolerated at the start of the first exon and the end of last exon, while -d
is the tolerance
for all other exon boundaries.
Transcript clusters having size less than the -c
parameter are discarded. This parameter has the largest effect on the sensitivity and specificity of transcript reconstruction. Larger values usually lead to higher specificity at the expense of lowering sensitivity.
Example run with default minimum cluster size and tolerance values:
cluster_gff -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff
Example run with custom parameters:
cluster_gff -c 5 -e 50 -d 5 -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff
Usage of ./polish_clusters:
-V Print out version.
-a string
Read cluster memberships in tabular format.
-c int
Minimum cluster size. (default 1)
-d string
Location of temporary directory.
-h Print out help message.
-m Do not load all reads in memory (slower).
-o string
Output fasta file.
-t int
Number of cores to use. (default 4)
-x string
Arguments passed to minimap2.
-y string
Arguments passed to racon.
Example run:
polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 40 sorted.bam
The resulting consensus transcripts can be mapped to the genome using minimap2
.
Usage of ./collapse_partials:
-M Discard monoexonic transcripts.
-U Discard transcripts which are not oriented.
-V Print out version.
-d int
Internal exon boundary tolerance. (default 5)
-e int
Three prime exons boundary tolerance. (default 30)
-f int
Five prime exons boundary tolerance. (default 5000)
-h Print out help message.
-prof string
Write out CPU profiling information.
-t int
Number of cores to use. (default 4)
The -d
parameter is the exon boundary difference tolerated at internal splice sites, while -e
and -f
are the tolerance values at the 3' and 5' end
respectively. Transcripts which are not oriented are all assigned to distinct "loci" and left untouched by default (but see the -U
flag).
Example run:
collapse_partials -d 10 -e 35 -f 1000 input.gff > collapsed_output.gff
For running tests the following dependencies have to be installed:
Both are easy to install using bioconda.
Look into the Makefiles
for targets testing the tools on simulated and real data.
(c) 2018 Oxford Nanopore Technologies Ltd.
This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.
- The GFF2 files can be visualised using IGV.
- The GFF2 files can be converted to GFF3 or GTF using the gffread utility.
See the post announcing the tool at the Oxford Nanopore Technologies community here.