Question: DNA string compressing

Question

Question: DNA string compressing

jianshu93 opened this issue 2 years ago · comments

Dear seqtk author,

I am writing to ask a question about whether we can compress DNA strings {A,G,C,T} (also N?) when reading and storing DNA sequences in memory such as kmer and minimizer. By default a character in C take 1 byte but for valid DNA sequences, there are only 4 possibilities, instead of 256 (2^8), so we can compress DNA character into 2 bit, 1/4 of a regular character. Is this already implemented in kseq.h? Since I saw a lot of kmer counting and minimizer counting tools were based on kseq.h. When there are a huge number of kmers or minimizers, memory consumption difference using 2 bit and 1 byte could be huge.

Thanks,

Jianshu

Heng Li · Answer 1 · Sat Sep 03 2022 11:17:49 GMT+0800 (China Standard Time)

Read the source code of kmer counters (e.g. this) to see how this is handled.