tleonardi / pinfish

Tools to annotate genomes using long read transcriptomics data

Repository from Github https://github.comtleonardi/pinfishRepository from Github https://github.comtleonardi/pinfish

ONT_logo


pinfish

Pinfish is a collection of tools helping to make sense of long transcriptomics data (long cDNA reads, direct RNA reads). The toolchain is composed of the following tools:

  • spliced_bam2gff - a tool for converting sorted BAM files containing spliced alignments (generated by minimap2 or GMAP) into GFF2 format. Each read will be represented as a distinct transcript. This tool comes handy when visualizing spliced reads at particular loci and to provide input to the rest of the toolchain.
  • cluster_gff - this tool takes a sorted GFF2 file as input and clusters together reads having similar exon/intron structure and creates a rough consensus of the clusters by taking the median of exon boundaries from all transcripts in the cluster.
  • polish_clusters - this tool takes the cluster definitions generated by cluster_gff and for each cluster creates an error corrected read by mapping all reads on the read with the median length (using minimap2) and polishing it using racon. The polished reads can be mapped to the genome using minimap2 or GMAP.
  • collapse_partials - this tool takes GFFs generated by either cluster_gff or polish_clusters and filters out transcripts which are likely to be based on RNA degradation products from the 5' end. The tool clusters the input transcripts into "loci" by the 3' ends and discards transcripts which have a compatible transcripts in the loci with more exons.

Pinfish is largely inspired by the Mandalorion pipeline. It is meant to provide a quick way for generating annotations from long reads only and it is not meant to provide the same functionality as pipelines using a broader strategy for annotation (such as LoReAn).

The pinfish tools can be run via a Snakemake pipeline which handles the alignment tasks using minimap2.

Getting Started

Installation

The static linux binaries for the x86_64 platform are included in the respective subdirectories of the source tree. To install them simply copy them somewhere in your path.

The polish_clusters tool depends on the following software:

Dependencies and compiling from source

Compiling the tools from source require a working go compiler installation and the following packages installed via go get:

After installing dependencies simply issue make in the respective subdirectory.

Usage

spliced_bam2gff

Usage of spliced_bam2gff:
  -M    Input is from minimap2.
  -V    Print out version.
  -g    Use strand tag as feature orientation then read strand if not available.
  -h    Print out help message.
  -s    Use read strand (from BAM flag) as feature orientation.
  -t int
        Number of cores to use. (default 4)

The tool is looking by default for the XS tag in order to determine transcript orientation, unless the -M flag is specified in which case it is assumed that the input is from minimap2 and the ts tag is used instead (with different rules to determine the final orientation).

If no orientation tag is found, then the orientation is set to ., unless the -g flag is provided, in which case the read orientation from the BAM flag is used.

If the -s flag is specified all the rules above are ignored and the orientation is set to the read strand from the BAM flag (appropriate for stranded protocols).

Example run with minimap2 input:

spliced_bam2gff -M minimap_sorted.bam > raw_transcripts.gff

Example run with minimap2 input, stranded mode:

spliced_bam2gff -s minimap_sorted.bam > raw_transcripts.gff

Example run with GMAP input:

spliced_bam2gff gmap_sorted.bam > raw_transcripts.gff

cluster_gff

Usage of ./cluster_gff:
  -V    Print out version.
  -a string
        Write clusters in tabular format in this file.
  -c int
        Minimum cluster size. (default 10)
  -d int
        Exon boundary tolerance. (default 10)
  -e int
        Terminal exons boundary tolerance. (default 30)
  -h    Print out help message.
  -p float
        Minimum isoform percentage. (default 1)
  -prof string
        Write out CPU profiling information.
  -t int
        Number of cores to use. (default 4)

The -e parameter is the maximum distance tolerated at the start of the first exon and the end of last exon, while -d is the tolerance for all other exon boundaries.

Transcript clusters having size less than the -c parameter are discarded. This parameter has the largest effect on the sensitivity and specificity of transcript reconstruction. Larger values usually lead to higher specificity at the expense of lowering sensitivity.

Example run with default minimum cluster size and tolerance values:

cluster_gff -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff

Example run with custom parameters:

cluster_gff -c 5 -e 50 -d 5 -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff

polish_clusters

Usage of ./polish_clusters:
  -V    Print out version.
  -a string
        Read cluster memberships in tabular format.
  -c int
        Minimum cluster size. (default 1)
  -d string
        Location of temporary directory.
  -h    Print out help message.
  -m    Do not load all reads in memory (slower).
  -o string
        Output fasta file.
  -t int
        Number of cores to use. (default 4)
  -x string
        Arguments passed to minimap2.
  -y string
        Arguments passed to racon.

Example run:

polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 40 sorted.bam

The resulting consensus transcripts can be mapped to the genome using minimap2.

collapse_partials

Usage of ./collapse_partials:
  -M    Discard monoexonic transcripts.
  -U    Discard transcripts which are not oriented.
  -V    Print out version.
  -d int
        Internal exon boundary tolerance. (default 5)
  -e int
        Three prime exons boundary tolerance. (default 30)
  -f int
        Five prime exons boundary tolerance. (default 5000)
  -h    Print out help message.
  -prof string
        Write out CPU profiling information.
  -t int
        Number of cores to use. (default 4)

The -d parameter is the exon boundary difference tolerated at internal splice sites, while -e and -f are the tolerance values at the 3' and 5' end respectively. Transcripts which are not oriented are all assigned to distinct "loci" and left untouched by default (but see the -U flag).

Example run:

collapse_partials -d 10 -e 35 -f 1000 input.gff > collapsed_output.gff

Running tests

For running tests the following dependencies have to be installed:

Both are easy to install using bioconda. Look into the Makefiles for targets testing the tools on simulated and real data.

Help

Licence and Copyright

(c) 2018 Oxford Nanopore Technologies Ltd.

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

FAQs and tips

  • The GFF2 files can be visualised using IGV.
  • The GFF2 files can be converted to GFF3 or GTF using the gffread utility.

References and Supporting Information

See the post announcing the tool at the Oxford Nanopore Technologies community here.

About

Tools to annotate genomes using long read transcriptomics data

License:Other


Languages

Language:Go 91.5%Language:Makefile 8.5%