zhaokai2014 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting Started

# Install hifiasm (requiring g++ and zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make
# Assembly
./hifiasm -o NA12878.asm -t 32 NA12878.fq.gz

Introduction

Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. Unlike most existing assemblers, hifiasm starts from uncollapsed genome. Thus, it is able to keep the haplotype information as much as possible.

For non-trio assembly, the input of hifiasm is the PacBio Hifi reads in fasta/fastq format, and its outputs consist of:

  1. Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information, including somatic mutations and recurrent sequencing errors.
  2. Haplotype-resolved processed unitig graph without small bubbles (prefix.p_utg.gfa). Small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information.
  3. Primary assembly contig graph (prefix.p_ctg.gfa). This graph collapses different haplotypes.
  4. Alternate assembly contig graph (prefix.a_ctg.gfa). This graph consists of all assemblies that are discarded in primary contig graph.

For trio assembly, the input of hifiasm is the PacBio Hifi reads in fasta/fastq format, and the paternal/maternal trio indexes generated by yak count (see https://github.com/lh3/yak). The outputs consist of:

  1. Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information.

  2. Phased paternal/haplotype1 contig graph (prefix.hap1.p_ctg.gfa). This graph keeps the phased paternal/haplotype1 assembly.

  3. Phased maternal/haplotype2 contig graph (prefix.hap2.p_ctg.gfa). This graph keeps the phased maternal/haplotype2 assembly.

In addition, hifiasm also outputs three binary files that save all overlap information (prefix.ec.bin, prefix.ovlp.reverse.bin, prefix.ovlp.source.bin). With these files, hifiasm can avoid the time-consuming all-to-all overlap calculation step, and do the assembly directly and quickly. This might be helpful when you want to get an optimized assembly by multiple rounds of experiments with different parameters.

Hifiasm is a standalone and lightweight assembler, which does not need external libraries (except zlib). For large genomes, it can generate high-quality assembly in a few hours. Hifiasm has been tested on human, butterfly, rice and drosophila. In particular, hifiasm is able to assemble the 26.5Gb California redwood tree in a few days. The results are as follows:

Dataset GSize Cov Asm options CPU time Wall time RAM unitig/contig N50[1]
[Redwood] 26.5Gb x23 -k 40 -t 64 -r 2 7274h30m 141h30m 512G 1.7Mb/1.9Mb

[1] unitig N50 is the N50 of assembly graph with haplotype information (i.e., bubbles), while the contig N50 is the N50 of haplotype collapsed assembly (i.e., without bubbles).

Usage

For Hifi reads assembly, a typical command line looks like:

./hifiasm -o NA12878.asm -t 32 NA12878.fq.gz

where NA12878.fq.gz is the input reads and -o specifies the output files. In this example, all output files can be found at NA12878.asm.*. -t specifies the number of CPU threads. Note that at first run, hifiasm will save all overlaps to disk, which can avoid the time-consuming all-to-all overlap calculation next time. For hifiasm, once the overlap information has been obtained during the previous run in advance, it is able to load all overlaps from disk and then directly do assembly. If you want to ignore the pre-computed overlap information, please specify -i.

Please note that some old Hifi reads may consist of short adapters. To improve the assembly quality, adapters should be removed by -z as follow:

./hifiasm -o butterfly.asm -t 42 -z 20 butterfly.fq.gz

In this example, hifiasm will remove 20 bases from both ends of each read.

For trio assembly, first the trio indexes of paternal/maternal should be generated by yak count (see https://github.com/lh3/yak):

./yak count -k31 -b37 -t16 -o mat.yak mat.fq.gz
./yak count -k31 -b37 -t16 -o pat.yak pat.fq.gz

and then run hifiasm as follow:

./hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878_1.fq.gz NA12878_2.fq.gz

Getting Help

For detailed description of options, please see man ./hifiasm.1. The -h option of hifiasm also provides simple description of options. If you have further questions, please raise an issue at the issue page.

Limitations and future works

  1. The running time and memory usage should be further reduced.

  2. The N50 should be further improved.

About

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

License:MIT License


Languages

Language:C++ 89.7%Language:C 9.4%Language:Roff 0.5%Language:Objective-C 0.3%Language:Makefile 0.2%