Improve k-mer filter
johnlees opened this issue · comments
Improving on #4. This is left for future work, for now.
The current general hash table approach is a reasonable first trade-off between accuracy (it's totally accurate), fast (it's quite slow, but could be worse) and memory use (~1Gb -> will fit on a raspberry pi). As this filtering is the main resource user for read sketching, but changing it will not change sketch results, this is a potential area for future improvement.
Possible improvements include:
- Having different filter versions which can be changed depending on available resources
- For speed but more memory, a minimum perfect hash function could be tried (probably needs k limited < 31?, and some loss in accuracy if collision detection isn't work it. Masking the first half of the 64-bit hash would be one option). Discussion on gramtools suggest this may not be the best.
- Using sparse_map for memory restricted applications
- Using an approximate filter, e.g. khmer's count-min or squeakr's CQF
Other references:
- A rolling hash (like the one already included) is probably effective (see https://www.biostars.org/p/8219/)
- There is some useful discussion here: iqbal-lab-org/gramtools#71
- And the jellyfish paper: https://academic.oup.com/bioinformatics/article/27/6/764/234905
nthash may simplify this.
It might even be worth trying this instead of rollinghash in the main function, but this choice needs to be finalized now as it will change sketch results
nthash (6a4a4a2) looks to be slightly quicker on both tests, and has similar memory use. This might be due to by poor seqio.hpp move code for the next and out, or just that they integrated all of this with c strings anyway.
But whatever the case I will switch to this, especially as:
- Any speed boost is welcome
- The code becomes a lot simpler
- Multiple hashes can be generated trivially and efficiently, which could come in handy with a future filter replacement
rollinghash
Command being timed: "./sketch_test 12754_5#71 12754_5#71.contigs_velvet.fa 12754_5#72 12754_5#72.contigs_velvet.fa rfiles.txt"
User time (seconds): 73.21
System time (seconds): 0.69
Percent of CPU this job got: 363%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.35
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 48844
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 34150
Voluntary context switches: 1307
Involuntary context switches: 177
Swaps: 0
File system inputs: 0
File system outputs: 39040
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Command being timed: "./read_test 12673_8_26 12673_8#26.fastq.gz 12754_4_66 12754_4#66_1.fastq.gz 12754_4#66_2.fastq.gz"
User time (seconds): 316.31
System time (seconds): 9.38
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:26.13
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1522004
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1573477
Voluntary context switches: 47
Involuntary context switches: 334
Swaps: 0
File system inputs: 392920
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
nthash
Command being timed: "./sketch_test 12754_5#71 12754_5#71.contigs_velvet.fa 12754_5#72 12754_5#72.contigs_velvet.fa rfiles.txt"
User time (seconds): 64.84
System time (seconds): 0.54
Percent of CPU this job got: 363%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.99
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 49308
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 34260
Voluntary context switches: 1786
Involuntary context switches: 204
Swaps: 0
File system inputs: 0
File system outputs: 39104
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Command being timed: "./read_test 12673_8_26 12673_8#26.fastq.gz 12754_4_66 12754_4#66_1.fastq.gz 12754_4#66_2.fastq.gz"
User time (seconds): 252.22
System time (seconds): 10.66
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:23.58
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1522152
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1569866
Voluntary context switches: 63
Involuntary context switches: 180
Swaps: 0
File system inputs: 1606592
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0