bcgsc / ntedit_sealer_protocol

Efficient targeted error resolution and automated finishing of long-read genome assemblies

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ntEdit+Sealer Assembly Finishing Protocol

Logo

An automated protocol for finishing long-read genome assemblies using short reads. ntEdit polishes the draft assembly and flags erroneous regions, then Sealer fills assembly gaps and erroneous sequence regions flagged by ntEdit. The protocol is implemented as a Makefile pipeline.

ntEdit+Sealer protocol flowchart

Dependencies

  • GNU Make
  • Python 3
  • ntHits v0.0.1+
  • ntEdit v1.3.5+
  • ABySS v2.3.2+ (includes Sealer and ABySS-Bloom)

Installation

The ntEdit+Sealer dependencies are available from Conda:

conda install -c bioconda nthits ntedit abyss

All dependencies are also available from Homebrew:

brew install brewsci/bio/nthits ntedit abyss

This repository, containing the Makefile pipeline and additional scripts, can be cloned from Github:

git clone https://github.com/bcgsc/ntedit_sealer_protocol.git

To run the protocol, ensure that all dependencies are on your PATH.

Example Command

For example, to run the pipeline on a draft long-read assembly draft-assembly.fa with short read files reads_1.fq.gz and reads_2.fq.gz, k-mer lengths k=80, k=65 and k=50, specifying the ABySS-Bloom Bloom filter size to be 5G:

ntedit-sealer finish seqs=draft-assembly.fa reads='reads_1.fq.gz reads_2.fq.gz' k='80 65 50' b=5G

The corrected, finished assembly can be found with the suffix .ntedit_edited.prepd.sealer_scaffold.fa.

Help Page

Usage: ntedit-sealer finish [OPTION=VALUE]

General options:
seqs			Draft assembly name [seqs]. File must have .fa extension
reads			Read file(s). All files must have .fq.gz extension. Must be separated by spaces and surrounded by quotes
k			K-mer sizes. List must be descending, separated by spaces and surrounded by quotes
t			Number of threads [8]
time			If True, will log the time for each step [False]

ntEdit options:
X			Ratio of number of kmers in the k subset that should be missing in order to attempt fix (higher=stringent) [0.5]
Y			Ratio of number of kmers in the k subset that should be present to accept an edit (higher=stringent) [0.5]

ABySS-bloom options:
b			Bloom filter size (e.g. 100M)

Sealer options:
L			Length of flanks to be used as pseudoreads [100]
P			Maximum alternate paths to merge; use 'nolimit' for no limit [10]

Notes:
 - Pass all parameter list values (reads, k) as space-separated values surrounded by quotation marks, e.g. k='80 65 50'
 - Ensure that all input files are in the current working directory, making soft-links if needed
 - K-mer lengths will be used in the order they are provided. Ensure that they are sorted in descending order (largest to smallest)

Running ntedit-sealer help prints the help documentation.

Citing ntEdit-Sealer, ntEdit and Sealer


Thank you for your Stars and for using, developing and promoting this free software!

If you use ntEdit-Sealer, ntEdit or Sealer in your research, please cite:

ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long-Read Genome Assemblies

ntEdit+Sealer: Efficient targeted error resolution and automated finishing of long-read genome assemblies.
Li JX, Coombe L, Wong J, Birol I, Warren RL. 
Curr. Protocols. 2022. 2:e442 

ntEdit: Scalable Genome Sequence Polishing

ntEdit: Scalable genome sequence polishing.
Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, Birol I.
Bioinformatics. 2019. Nov 1;35(21):4430-4432. doi: 10.1093/bioinformatics/btz400.

Sealer: A Scalable Gap-closing Application for Finishing Draft Genomes

Sealer: A scalable gap-closing application for finishing draft genomes. 
Paulino D*, Warren RL*, Vandervalk BP, Raymond A, Jackman SD, Birol I. 
BMC Bioinformatics. 2015. 16:230

License


ntEdit-Sealer Copyright (c) 2015-2022 British Columbia Cancer Agency Branch. All rights reserved.

ntEdit and Sealer are released under the GNU General Public License v3

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

For commercial licensing options, please contact Patrick Rebstein prebstein@bccancer.bc.ca

About

Efficient targeted error resolution and automated finishing of long-read genome assemblies


Languages

Language:Makefile 59.5%Language:Python 31.2%Language:Shell 9.3%