MaestSi / ONTrack

A MinION-based pipeline for tracking species biodiversity

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ONTrack is no longer mantained. Please use ONTrack2 pipeline instead.

ONTrack

ONTrack is a rapid and accurate MinION-based barcoding pipeline for tracking species biodiversity on site; starting from MinION sequence reads, the ONTrack pipeline is able to provide accurate consensus sequences in ~15 minutes per sample on a standard laptop. Moreover, a preprocessing pipeline is provided, so to make the whole bioinformatic analysis from raw fast5 files to consensus sequences straightforward and simple.

drawing

Getting started

Prerequisites

  • Miniconda3. Tested with conda 4.6.11. which conda should return the path to the executable. If you don't have Miniconda3 installed, you could download and install it with:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod 755 Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh

Then, after completing ONTrack installation, set the MINICONDA_DIR variable in config_MinION_mobile_lab.R to the full path to miniconda3 directory.

  • Guppy, the software for basecalling and demultiplexing provided by ONT. Tested with Guppy v5.0. If you don't have Guppy installed, choose an appropriate version and install it. For example, you could download and unpack the archive with:
wget /path/to/ont-guppy-cpu_version_of_interest.tar.gz
tar -xf ont-guppy-cpu_version_of_interest.tar.gz

A directory ont-guppy-cpu should have been created in your current directory. Then, after completing ONTrack installation, set the BASECALLER_DIR variable in config_MinION_mobile_lab.R to the full path to ont-guppy-cpu/bin directory.

  • NCBI nt database (optional, in case you want to perform a local Blast analysis of your consensus sequences).

For downloading the database (~210 GB):

mkdir NCBI_nt_db
cd NCBI_nt_db
echo `date +%Y-%m-%d` > download_date.txt
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*
targz_files=$(find . | grep "\\.tar\\.gz$")
for f in $targz_files; do
  tar -xzvf $f;
  rm $f;
  rm $f".md5";
done

Then, after completing the ONTrack installation, set the NTDB variable in config_MinION_mobile_lab.R to the full path to NCBI_nt_db/nt

Installation

git clone https://github.com/MaestSi/ONTrack.git
cd ONTrack
chmod 755 *
./install.sh

Otherwise, you can download a docker image with:

docker pull maestsi/ontrack:latest

A conda environment named ONTrack_env is created, where blast, emboss, vsearch, seqtk, mafft, minimap2, samtools, nanopolish, bedtools, pycoQC and R with package Biostrings are installed. Then, you can open the config_MinION_mobile_lab.R file with a text editor and set the variables PIPELINE_DIR and MINICONDA_DIR to the value suggested by the installation step.

Overview

drawing

Usage

The ONTrack pipeline can be applied either starting from raw fast5 files, or from already basecalled and demultiplexed sequences. In both cases, the first step of the pipeline requires you to open the config_MinION_mobile_lab.R file with a text editor and to modify it according to the features of your sequencing experiment and your preferences. If you have already basecalled and demultiplexed your sequences, you can run the pipeline using the ONTrack.R script. Otherwise, you can run the pipeline using the Launch_MinION_mobile_lab.sh script.

ONTrack.R

Usage: Rscript ONTrack.R <home_dir> <fast5_dir> <sequencing_summary.txt>

Note: Activate the virtual environment with source activate ONTrack_env before running. The script is run by MinION_mobile_lab.R, but can be also run as a main script if you have already basecalled and demultiplexed your sequences. If less than 200 reads are available after contaminants removal, a warning message is printed out, but still a consensus sequence is produced.

Inputs:

  • <home_dir>: directory containing fastq and fasta files for each sample named BC<numbers>.fast*
  • <fast5_dir>: directory containing raw fast5 files for nanopolish polishing, optional
  • <sequencing_summary.txt>: sequencing summary file generated during base-calling, used to speed-up polishing, optional

Outputs (saved in <home_dir>):

  • <"sample_name".contigs.fasta>: polished consensus sequence in fasta format
  • <"sample_name".blastn.txt>: blast analysis of consensus sequence against NCBI nt database (if do_blast_flag variable is set to 1 in config_MinION_mobile_lab.R)
  • <"sample_name">: directory including intermediate files

Launch_MinION_mobile_lab.sh

Usage: Launch_MinION_mobile_lab.sh <fast5_dir>

Note: modify config_MinION_mobile_lab.R before running; the script runs the full pipeline from raw fast5 files to consensus sequences.

Input

  • <fast5_dir>: directory containing raw fast5 files

Outputs (saved in <fast5_dir>_analysis/analysis):

  • <"sample_name".contigs.fasta>: polished consensus sequence in fasta format
  • <"sample_name".blastn.txt>: blast analysis of consensus sequence against NCBI nt database (if do_blast_flag variable is set to 1 in config_MinION_mobile_lab.R)
  • <"sample_name">: directory including intermediate files

Outputs (saved in <fast5_dir>_analysis/qc):

  • Read length distributions and pycoQC report

Outputs (saved in <fast5_dir>_analysis/basecalling):

  • Temporary files for basecalling

Outputs (saved in <fast5_dir>_analysis/preprocessing):

  • Temporary files for demultiplexing, filtering based on read length and adapters trimming

Auxiliary scripts

In the following, auxiliary scripts run either by ONTrack.R or by Launch_MinION_mobile_lab.sh are listed. These scripts should not be called directly.

MinION_mobile_lab.R

Note: script run by Launch_MinION_mobile_lab.sh.

config_MinION_mobile_lab.R

Note: configuration script, must be modified before running Launch_MinION_mobile_lab.sh or ONTrack.R.

subsample_fast5.sh

Note: script run by MinION_mobile_lab.R if do_subsampling_flag variable is set to 1 in config_MinION_mobile_lab.R.

remove_long_short.pl

Note: script run by MinION_mobile_lab.R for removing reads shorter than mean - 2*sd and longer than mean + 2*sd.

decONT.sh

Note: script run by ONTrack.R for clustering reads at 70% identity and keeping only reads in the most abundant cluster, if do_clustering_flag variable is set to 1 in config_MinION_mobile_lab.R.

Checking scripts

Sanger_check.sh

Usage: Sanger_check.sh <consensus dir> <sanger dir>

Note: Activate the virtual environment with source activate ONTrack_env before running; sample name should contain the sample id (e.g. BC01)

Inputs:

  • <consensus dir>: directory containing files "sample_name".contigs.fasta obtained with the ONTrack pipeline
  • <sanger dir>: directory containing fasta files reference_"sample_name".fasta obtained with Sanger sequencing

Output (saved in <contigs dir>):

  • <results_"sample_name".txt>: file including alignment of MinION consensus sequence to corresponding Sanger sequence
  • <Sanger_check_report.txt>: file including overall alignment statistics and number of uncertain nucleotides in Sanger sequences

Calculate_mapping_rate.sh

Usage: Calculate_mapping_rate.sh <reads> <draft reads> <consensus sequence>

Note: Activate the virtual environment with source activate ONTrack_env before running.

Inputs:

  • <reads>: MinION reads in fastq or fasta format
  • <draft reads>: MinION reads in fastq or fasta format used for creating draft consensus sequence, after contaminants removal
  • <consensus sequence>: polished consensus sequence in fasta format

Output (saved in current directory):

  • <"sample_name"_report_mapping_rate.txt>: mapping rate statistics

Calculate_error_rate.sh

Usage: Calculate_error_rate.sh <reads> <reference>

Note: Activate the virtual environment with source activate ONTrack_env before running.

Inputs:

  • <reads>: MinION reads in fastq or fasta format
  • <reference>: Sanger sequence corresponding to MinION reads

Outputs:

  • <"sample_name"_error_rate_stats.txt>: error rate statistics

Contaminants inspection analysis

When the mapping rate of all reads from a sample is not in the range 95%-100%, you might be interested either in spotting if there is a predominant contaminant, or in trying to rescue the consensus sequence of your sample, if based on Blast analysis you realize that the consensus sequence from the most abundant cluster is not from the sample that you were supposed to sequence. In these cases, you could try to retrieve the reads that don't map to your consensus sequence, and run the ONTrack pipeline again just on those reads. Remember in these cases to set the do_clustering_flag variable to 1 in the config_MinION_mobile_lab.R file. As an example, you could use the following code to retrieve unmapped reads for sample BC01 and save them to contaminants_analysis folder.

SAMPLE_NAME=BC01
ANALYSIS_DIR=/path/to/fast5_reads_analysis/analysis
PIPELINE_DIR=/path/to/ONTrack

source activate ONTrack_env

cd $ANALYSIS_DIR
$PIPELINE_DIR"/Calculate_mapping_rate.sh" $SAMPLE_NAME".fastq" $SAMPLE_NAME"/"$SAMPLE_NAME"_decont.fastq" $SAMPLE_NAME".contigs.fasta"

if [ ! -d "contaminants_analysis" ]; then
  mkdir contaminants_analysis
fi

samtools view -f4 -b $SAMPLE_NAME"_reads_on_contig.bam" > "contaminants_analysis/"$SAMPLE_NAME"_unmapped.bam"
bedtools bamtofastq -i "contaminants_analysis/"$SAMPLE_NAME"_unmapped.bam" -fq "contaminants_analysis/"$SAMPLE_NAME".fastq"
seqtk seq -A "contaminants_analysis/"$SAMPLE_NAME".fastq" > "contaminants_analysis/"$SAMPLE_NAME".fasta"

Meta-barcoding analysis (experimental)

Although the ONTrack pipeline is not intended for analysing meta-barcoding samples, you might be interested in sorting out sequences coming from different species and running the ONTrack pipeline on them separately. The MetatONTrack.sh script reproduces what the EPI2ME 16S workflow does, blasting each read against an NCBI-downloaded database (e.g. 16S Bacterial), and afterwards saving sets of reads matching the different species to separate files. You can then run the ONTrack.R script on them, for obtaining a more accurate consensus sequence (set do_clustering_flag to 0 in config_MinION_mobile_lab.R). This feature is experimental, and has only been tested on a pool of 7 samples with 80% maximum sequence identity based on pairwise alignment of Sanger sequences.

MetatONTrack.sh

Usage: MetatONTrack.sh <fastq reads> <min num reads>

Note: Activate the virtual environment with source activate ONTrack_env before running. Set DB variable to an NCBI Blast-indexed database inside the script.

Inputs:

  • <fastq reads>: MinION fastq reads from a meta-barcoding experiment
  • <min num reads>: minumum number of reads supporting the identification of a species

Outputs:

  • <MetatONTrack_output>: directory containing fastq and fasta files for running the ONTrack.R script
  • <MetatONTrack_output_logs>: directory containing txt files storing read IDs corresponding to each species, a "sample_name"_Blast_species_counts.txt file storing the number of reads supporting each species, "sample_name"_Blast_genera_counts.txt file storing the number of reads supporting each genus and some other temporary files

Citation

If this tool is useful for your work, please consider citing our manuscript.

Maestri S, Cosentino E, Paterno M, Freitag H, Garces JM, Marcolungo L, Alfano M, Njunjić I, Schilthuizen M, Slik F, Menegon M, Rossato M, Delledonne M. A Rapid and Accurate MinION-Based Workflow for Tracking Species Biodiversity in the Field. Genes. 2019; 10(6):468.

For further information and insights into pipeline development, please have a look at my doctoral thesis.

Maestri, S (2021). Development of novel bioinformatic pipelines for MinION-based DNA barcoding (Doctoral thesis, Università degli Studi di Verona, Verona, Italy). Retrieved from https://iris.univr.it/retrieve/handle/11562/1042782/205364/.

Side notes

As a real-life Pokédex, the workflow described in our manuscript will facilitate tracking biodiversity in remote and biodiversity-rich areas. For instance, during a Taxon Expedition to Borneo, our analysis confirmed the novelty of a beetle species named after Leonardo DiCaprio.

Last but not least, special thanks to Davide Canevazzi (davidecanevazzi) and Luca Marcolungo (Liukvr) for helping me out in setting up and debugging the pipeline.

About

A MinION-based pipeline for tracking species biodiversity

License:GNU General Public License v3.0


Languages

Language:R 69.7%Language:Shell 25.9%Language:Perl 2.3%Language:Dockerfile 2.2%