cartoonist / kseqpp

Fast FASTA/Q parser and writer (C++ re-implementation of kseq library)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is kseqpp faster than just using std::iostream?

HatakeKakaxi opened this issue · comments

This library implements both reading and writing of FASTQ files. I wonder is it faster than just using std::iostream ?

The short answer is "yes".

For IO operations, the library internally relies on unbuffered low-level read/write system calls and manages the underlying buffer itself. It is pretty similar to what buffered streams do in <iostream> but more focused on FASTQ/A parsing/writing rather than providing generic IO streams. This decision is mainly shaped by 'zlib' as well as the original 'kseq'. Additionally, this allows easier implementation of other features such as non-blocking write operation, custom file compression algorithms (with default gzip support) and etc.

There is a crude parser that I wrote based on <iostream> (reading the input file using ifstream) that you can find in benchmark/kseq++_bench.cpp. You can compile it by passing -DBUILD_BENCHMARKING=on to cmake. Running it on my laptop (MacBook Pro M1, 2020) using dataset_A.fa (from the dataset mentioned in the README) gives me this result:

$ ./build/benchmark/kseq++-bench /tmp/seqkit-benchmark-data/dataset_A.fa /tmp/output_A.fq
=== READ TESTS ===
[gzread] 0.342 sec
[ks_getc] 3.01 sec
[ks_getuntil] 1.68 sec
[gzgetc] 7.03 sec
[gzgets] 0.983 sec
[fgets] 1.88 sec
[kstream] 1.55 sec
[seqan] 3.65 sec
[kseq] 1.08 sec
[kseq++] 1.08 sec
[ifstream] 3.75 sec
[kseq++/read_all] 1.4 sec
[seqan/readRecords] 5.91 sec

It is ~3.5x faster than the implementation using iostream ([kseq++] 1.08 sec vs [ifstream] 3.75 sec) and it is as fast as kseq while it provides higher-level API.

By the way, if you find any issue in the benchmark implementations, I would really appreciate if you report it.