rob-p / kseqpp

Fast FASTA/Q parser and writer (C++ re-implementation of kseq library)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kseq++

kseq++ is a C++11 re-implementation of kseq by Heng Li. The goal for re-implementation of kseq is providing better API and resource management while preserving its flexibility and performance. Like original kseq, this parser is based on generic stream buffer and works with different file types. However, instead of using C macros, it uses C++ templates. The RAII-style class KStream is the main class which can be constructed by make_kstream function series or by calling its constructor directly (C++17). It gets the file object/pointer (can be of any type), its corresponding read/write function, and opening mode (mode::in or mode::out). In contrast with kseq, there is no need to specify the types, since they are inferred by compiler. Each record will be stored in a KSeq object.

It inherits all features from kseq (quoting from kseq homepage):

  • Parse both FASTA and FASTQ format, and even a mixture of FASTA and FASTQ records in one file.
  • Seamlessly adapt to gzipped compressed file when used with zlib.
  • Support multi-line FASTQ.
  • Work on a stream with an internal stream buffer.

while additionally provides:

  • simpler and more readable API
  • RAII-style memory management

The library also comes with FASTA/Q writer. Like reading, it can write mixed multi-line FASTA and FASTQ records with gzip compression. The writer is multi-threaded and the actual write function call happens in another thread in order to hide the IO latency.

Higher-level API

Apart from KStream class, this library provides another level of abstraction which hides most details and provides very simple API on top of KStream for working with sequence files: SeqStreamIn and SeqStreamOut for reading and writing a sequence file respectively. In order to prevent imposing any unwanted external libraries (e.g. zlib) , the SeqStream class set are defined in a separated header file (seqio.h) from the core library.

Reading a sequence file

These examples read FASTQ/A records one by one from either compressed or uncompressed file.

Using SeqStreamIn:

#include <iostream>
#include "seqio.h"

using namespace klibpp;

int main(int argc, char* argv[])
{
  KSeq record;
  SeqStreamIn iss("file.dat");
  while (iss >> record) {
    std::cout << record.name << std::endl;
    if (!record.comment.empty()) std::cout << record.comment << std::endl;
    std::cout << record.seq << std::endl;
    if (!record.qual.empty()) std::cout << record.qual << std::endl;
  }
}

Using KStream:

#include <iostream>
#include <zlib>
#include "kseq++.h"

using namespace klibpp;

int main(int argc, char* argv[])
{
  KSeq record;
  gzFile fp = gzopen(filename, "r");
  auto ks = make_kstream(fp, gzread, mode::in);
  // auto ks = KStream(fp, gzread, mode::in);  // C++17
  // auto ks = KStreamIn(fp, gzread);  // C++17
  while (ks >> record) {
    std::cout << record.name << std::endl;
    if (!record.comment.empty()) std::cout << record.comment << std::endl;
    std::cout << record.seq << std::endl;
    if (!record.qual.empty()) std::cout << record.qual << std::endl;
  }
  gzclose(fp);
}

Or records can be fetched and stored in a std::vector< KSeq > in chunks.

Using SeqStreamIn:

#include <iostream>
#include "seqio.h"

using namespace klibpp;

int main(int argc, char* argv[])
{
  SeqStreamIn iss("file.dat");
  auto records = iss.read();
  // auto records = iss.read(100);  // read a chunk of 100 records
}

Using KStream:

#include <iostream>
#include <zlib>
#include "kseq++.h"

using namespace klibpp;

int main(int argc, char* argv[])
{
  gzFile fp = gzopen(filename, "r");
  auto ks = make_ikstream(fp, gzread);
  auto records = ks.read();  // fetch all the records
  // auto records = ks.read(100);  // read a chunk of 100 records
  gzclose(fp);
}

Writing a sequence file

These examples write FASTA/Q records to an uncompressed file.

Using SeqStreamIn:

#include <iostream>
#include "seqio.h"

using namespace klibpp;

int main(int argc, char* argv[])
{
  SeqStreamOut oss("file.dat");
  for (KSeq const& r : records) oss << r;
}

Using KStream:

#include <iostream>
#include <zlib>
#include "kseq++.h"

using namespace klibpp;

int main(int argc, char* argv[])
{
  int fd = open(filename, O_WRONLY);
  auto ks = make_kstream(fd, write, mode::out);
  // auto ks = KStreamOut(fd, write);  // C++ 17
  // ...
  for (KSeq const& r : records) ks << r;
  ks << kend;
  close(fd);
}

While writing a record to a file, sequence and quality scores can be wrapped at a certain length. The default wrapping length is 60 bps and can be customised by KStream::set_wraplen method.


NOTE

The buffer will be flushed to the file when the KStream object goes out of the scope. Otherwise, ks << kend is required to be called before closing the file to make sure that there is no data loss.

There is no need to write kend to the stream if using SeqStreamOut.


Formatting

The default behaviour is to write a record in FASTQ format if it has quality information. Otherwise, i.e. when the quality string is empty, the record will be written in FASTA format. So, the output might be a mixture of FASTQ and FASTA records. However, the output format can be forced by using format::fasta and format::fastq modifiers. For example:

out << format::fasta << fastq_record;
out << another_record;  // all other calls after this will also be in FASTA format.

will write a FASTQ record in FASTA format. These modifiers affect all writes after them until another modifier is used. The format::mix modifier reverts the behaviour to default.


NOTE

Writing a FASTA record in FASTQ format throws an exception unless the record is empty (a record with empty sequence and quality string).


Benchmark

Datasets

For this benchmark, I re-used sequence files from SeqKit benchmark: seqkit-benchmark-data.tar.gz

file format type num_seqs sum_len min_len avg_len max_len
dataset_A.fa FASTA DNA 67,748 2,807,643,808 56 41,442.5 5,976,145
dataset_B.fa FASTA DNA 194 3,099,750,718 970 15,978,096.5 248,956,422
dataset_C.fq FASTQ DNA 9,186,045 918,604,500 100 100 100

Platform

  • CPU: Intel® Xeon® CPU E3-1241 v3 @ 3.50GHz, 4 cores, 8 threads
  • RAM: DDR3 1600 MHz, 16352 MB
  • HDD: Seagate Desktop HDD 500GB, 16MB Cache, SATA-3
  • OS: Debian GNU/Linux 9.4 (stretch), Linux 4.9.91-1-amd64-smp
  • Compiler: GCC 6.3.0, compiled with optimisation level 3 (-O3)

Result (for kseq++ v0.1.4)

Reading all records

file kseq++ kseq SeqAn kseq++/read* SeqAn/readRecords**
dataset_A.fa 2.35 s 2.5 s 2.92 s 3.52 s 4.94 s
dataset_B.fa 2.66 s 2.8 s 3.34 s 3.74 s 9.82 s
dataset_C.fq 2.56 s 2.46 s 2.66 s 4.56 s 11.8 s

* storing all records in std::vector.

** storing all records in seqan::StringSet< seqan::CharString >.

Writing all records

file kseq++/plain kseq++/gzipped SeqAn/plain
dataset_A.fa 2.3 s 866 s 2.29 s
dataset_B.fa 2.19 s 849 s 2.33 s
dataset_C.fq 1.94 s 365 s 2.24 s

About

Fast FASTA/Q parser and writer (C++ re-implementation of kseq library)

License:MIT License


Languages

Language:C++ 73.3%Language:C 17.5%Language:CMake 5.9%Language:Makefile 3.2%