sachingadakh / NanoCaller

Variant calling tool for long-read sequencing data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NanoCaller

NanoCaller is a computational method that integrates long reads in deep convolutional neural network for the detection of SNPs/indels from long-read sequencing data. NanoCaller uses long-range haplotype structure to generate predictions for each SNP candidate variant site by considering pileup information of other candidate sites sharing reads. Subsequently, it performs read phasing, and carries out local realignment of each set of phased reads and the set of all reads for each indel candidate variant site to generate indel calling, and then creates consensus sequences for indel sequence prediction.

NanoCaller is distributed under the MIT License by Wang Genomics Lab.

Latest Updates

v3.0.0 (June 7 2022) : A major update in API with single entry point for running NanoCaller. Major changes in parallelization routine with GNU parallel no longer used for whole genome variant calling.

v2.0.0 (Feb 2 2022) : A major update in API and installation instructions, with release of bioconda recipe for NanoCaller. Added support for indel calling in case of poor or non-existent phasing.

v1.0.0 (Aug 8 2021) : First post-production release with citeable DOI: DOI

v0.4.1 (Aug 3 2021) : Fixed a bug causing slower runtime in whole genome variant calling mode.

v0.4.0 (June 2 2021) : Added NanoCaller models trained on ONT reads basecalled with Guppy v4.2.2 and Bonito v0.30, as well as R10.3 reads. Added new NanoCaller models trained with long CCS reads (15-20kb library selection). Improved indel calling with rolling window for candidate selection which helps with indels in low complexity regions.

Installation

NanoCaller can be installed using Docker or Conda. The easiest way to install is from the bioconda channel:

conda install -c bioconda nanocaller

or using Docker:

VERSION="3.0.0"
docker pull genomicslab/nanocaller:${VERSION}

Please refer to Installation for instructions regarding installing NanoCaller through other methods.

Usage

General usage of NanoCaller is described in Usage. Some quick usage examples:

  • NanoCaller --bam YOU_BAM --ref YOU_REF --cpu 10 will run NanoCaller on whole genome using 10 parallel processes.
  • NanoCaller --bam YOU_BAM --ref YOU_REF --cpu 10 --regions chr22:20000000-21000000 chr21 will NanoCaller on chr21 and chr22:20000000-21000000 only.
  • NanoCaller --bam YOU_BAM --ref YOU_REF --cpu 10 --mode snps will only call SNPs.

For a comprehensive case study of variant calling on Nanopore reads, see ONT Case Study, where we describe end-to-end variant calling pipeline for using NanoCaller, where we start with aligning FASTQ files of HG002, calls variants using NanoCaller, and evaluate performances on various genomic regions.

Trained models

Trained models for ONT data, CLR data and HIFI data can be found here. These models are trained on chr1-22 of the genomes stated below, unless mentioned othewise.

You can specify SNP and indel models using --snp_model and --indel_model parameters with a model name from tables below. For instance, if you want to use 'ONT-HG002_bonito' SNP model and 'ONT-HG002' indel model, use the following command:

NanoCaller --snp_model ONT-HG002_bonito --indel_model ONT-HG002

SNP Models

Model Name Sequencing Technology Genome Coverage Benchmark Basecaller
ONT-HG001 ONT R9.4.1 HG001 55 v3.3.2 Guppy4.2.2
ONT-HG001_GP2.3.8 ONT R9.4.1 HG001 34 v3.3.2 Guppy2.3.8
ONT-HG001_GP2.3.8-4.2.2 ONT R9.4.1 HG001 45 v3.3.2 Guppy (2.3.8 + 4.2.2)
ONT-HG001-4_GP4.2.2 ONT R9.4.1 HG001-4 69 v3.3.2 (HG001) + v4.2.1 (HG002-4) Guppy4.2.2
ONT-HG002 ONT R9.4.1 HG002 47 v4.2.1 Guppy4.2.2
ONT-HG002_GP4.2.2_v3.3.2 ONT R9.4.1 HG002 47 v3.3.2 Guppy4.2.2
ONT-HG002_GP2.3.4_v3.3.2 ONT R9.4.1 HG002 53 v3.3.2 Guppy2.3.4
ONT-HG002_GP2.3.4_v4.2.1 ONT R9.4.1 HG002 53 v4.2.1 Guppy2.3.4
ONT-HG002_bonito ONT R9.4.1 HG002 (chr1-21) 51 v4.2.1 Bonito v0.30
ONT-HG002_r10.3 ONT R10.3 HG002 (chr1-21) 32 v4.2.1 Guppy4.0.11
CCS-HG001 PacBio CCS HG001 57 v3.3.2 -
CCS-HG002 PacBio CCS HG002 56 v4.2.1 -
CCS-HG001-4 PacBio CCS HG001-4 55 v3.3.2 (HG001) + v4.2.1 (HG002-4) Guppy4.2.2
CLR-HG002 PacBio CLR HG002 58 v4.2.1 -
NanoCaller1 ONT R9.4.1 HG001 34 v3.3.2 Guppy2.3.8
NanoCaller2 ONT R9.4.1 HG002 53 v3.3.2 Guppy2.3.4
NanoCaller3 PacBio CLR HG003 28 v3.3.2 -

Indel Models

Model Name Sequencing Technology Genome Coverage Benchmark Basecaller
ONT-HG001 ONT R9.4.1 HG001 55 v3.3.2 Guppy4.2.2
ONT-HG002 ONT R9.4.1 HG002 47 v4.2.1 Guppy4.2.2
CCS-HG001 PacBio CCS HG001 57 v3.3.2 -
CCS-HG002 PacBio CCS HG002 56 v4.2.1 -
NanoCaller1 ONT R9.4.1 HG001 34 v3.3.2 Guppy2.3.8
NanoCaller3 PacBio CCS HG001 29 v3.3.2 -

Citing NanoCaller

Please cite: Ahsan, M.U., Liu, Q., Fang, L. et al. NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks. Genome Biol 22, 261 (2021). https://doi.org/10.1186/s13059-021-02472-2.

About

Variant calling tool for long-read sequencing data

License:MIT License


Languages

Language:Python 99.4%Language:Dockerfile 0.6%