lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: DNA string compressing

jianshu93 opened this issue · comments

Dear seqtk author,

I am writing to ask a question about whether we can compress DNA strings {A,G,C,T} (also N?) when reading and storing DNA sequences in memory such as kmer and minimizer. By default a character in C take 1 byte but for valid DNA sequences, there are only 4 possibilities, instead of 256 (2^8), so we can compress DNA character into 2 bit, 1/4 of a regular character. Is this already implemented in kseq.h? Since I saw a lot of kmer counting and minimizer counting tools were based on kseq.h. When there are a huge number of kmers or minimizers, memory consumption difference using 2 bit and 1 byte could be huge.

Thanks,

Jianshu

Read the source code of kmer counters (e.g. this) to see how this is handled.