K-mer counting is a process with the goal of creating a histogram of all possible combinations of length k for an input string S. From an algorithmic point of view, counting k-mers in a string seems like a very simple task but with recent advances in sequencing technology, more and more sequencing machines are generating a large amount of data in a very short time and makes the simple task of generating a histogram a challenge. In recent years, the performance of k-mer counting algorithms has improved significantly, and there has been much interest in using graphics processing units (GPUs) to accomplish the task of counting k-mers. The fundamental purpose of this research is to analyze different algorithms to count the number of occurrences in a sequence with different k-mer settings and subsequently to optimize and speed up one of the algorithms by using GPUs.
Source code repository: lh3/kmer-cnt
This repository contains all the the source code of the diferent implementations that has been used and experimented with for this research.
git clone https://github.com/im-mou/gpu-kmer-counter
cd gpu-kmer-counter
make
wget https://github.com/lh3/kmer-cnt/releases/download/v0.1/M_abscessus_HiSeq_10M.fa.gz
./parse-data ./M_abscessus_HiSeq_10M.fa.gz
By default all these script execute the code with k-mer length of 32. If you choose to experiment with a diferent k size, you can edit the corresponding slurm file and uncomment the line with the desired k-length.
sbatch ./slurm_scripts/slurm-kc-c1-fast.sub
sbatch ./slurm_scripts/slurm-cuda-fast.sub
sbatch ./slurm_scripts/slurm-cuda-dumb.sub
The two properly working final sequential and parallel implementation with the correct outputs are the following:
- kc-c1-fast.c
- cuda-fast.cu