# Install hifiasm (requiring g++ and zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make
# Assembly
./hifiasm -o NA12878.asm -t 32 NA12878.fq.gz
Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. Unlike most existing assemblers, hifiasm starts from uncollapsed genome. Thus, it is able to keep the haplotype information as much as possible.
For non-trio assembly, the input of hifiasm is the PacBio Hifi reads in fasta/fastq format, and its outputs consist of:
- Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information, including somatic mutations and recurrent sequencing errors.
- Haplotype-resolved processed unitig graph without small bubbles (prefix.p_utg.gfa). Small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information.
- Primary assembly contig graph (prefix.p_ctg.gfa). This graph collapses different haplotypes.
- Alternate assembly contig graph (prefix.a_ctg.gfa). This graph consists of all assemblies that are discarded in primary contig graph.
For trio assembly, the input of hifiasm is the PacBio Hifi reads in fasta/fastq format, and the paternal/maternal trio indexes generated by yak count
(see https://github.com/lh3/yak). The outputs consist of:
-
Haplotype-resolved raw unitig graph in GFA format (prefix.r_utg.gfa). This graph keeps all haplotype information.
-
Phased paternal/haplotype1 contig graph (prefix.hap1.p_ctg.gfa). This graph keeps the phased paternal/haplotype1 assembly.
-
Phased maternal/haplotype2 contig graph (prefix.hap2.p_ctg.gfa). This graph keeps the phased maternal/haplotype2 assembly.
In addition, hifiasm also outputs three binary files that save all overlap information (prefix.ec.bin, prefix.ovlp.reverse.bin, prefix.ovlp.source.bin). With these files, hifiasm can avoid the time-consuming all-to-all overlap calculation step, and do the assembly directly and quickly. This might be helpful when you want to get an optimized assembly by multiple rounds of experiments with different parameters.
Hifiasm is a standalone and lightweight assembler, which does not need external libraries (except zlib). For large genomes, it can generate high-quality assembly in a few hours. Hifiasm has been tested on human, butterfly, rice and drosophila. In particular, hifiasm is able to assemble the 26.5Gb California redwood tree in a few days. The results are as follows:
Dataset | GSize | Cov | Asm options | CPU time | Wall time | RAM | unitig/contig N50[1] |
---|---|---|---|---|---|---|---|
[Redwood] | 26.5Gb | x23 | -k 40 -t 64 -r 2 | 7274h30m | 141h30m | 512G | 1.7Mb/1.9Mb |
[1] unitig N50 is the N50 of assembly graph with haplotype information (i.e., bubbles), while the contig N50 is the N50 of haplotype collapsed assembly (i.e., without bubbles).
For Hifi reads assembly, a typical command line looks like:
./hifiasm -o NA12878.asm -t 32 NA12878.fq.gz
where NA12878.fq.gz
is the input reads and -o
specifies the output files.
In this example, all output files can be found at NA12878.asm.*
. -t
specifies
the number of CPU threads. Note that at first run, hifiasm will save all overlaps
to disk, which can avoid the time-consuming all-to-all overlap calculation next time.
For hifiasm, once the overlap information has been obtained during the previous run
in advance, it is able to load all overlaps from disk and then directly do assembly.
If you want to ignore the pre-computed overlap information, please specify -i
.
Please note that some old Hifi reads may consist of short adapters. To improve
the assembly quality, adapters should be removed by -z
as follow:
./hifiasm -o butterfly.asm -t 42 -z 20 butterfly.fq.gz
In this example, hifiasm will remove 20 bases from both ends of each read.
For trio assembly, first the trio indexes of paternal/maternal should be generated by
yak count
(see https://github.com/lh3/yak):
./yak count -k31 -b37 -t16 -o mat.yak mat.fq.gz
./yak count -k31 -b37 -t16 -o pat.yak pat.fq.gz
and then run hifiasm as follow:
./hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878_1.fq.gz NA12878_2.fq.gz
For detailed description of options, please see man ./hifiasm.1
.
The -h
option of hifiasm also provides simple description of options. If you
have further questions, please raise an issue at the issue page.
-
The running time and memory usage should be further reduced.
-
The N50 should be further improved.