bacpop / pp-sketchlib

Library of sketching functions used by PopPUNK

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve k-mer filter

johnlees opened this issue · comments

Improving on #4. This is left for future work, for now.

The current general hash table approach is a reasonable first trade-off between accuracy (it's totally accurate), fast (it's quite slow, but could be worse) and memory use (~1Gb -> will fit on a raspberry pi). As this filtering is the main resource user for read sketching, but changing it will not change sketch results, this is a potential area for future improvement.

Possible improvements include:

  • Having different filter versions which can be changed depending on available resources
  • For speed but more memory, a minimum perfect hash function could be tried (probably needs k limited < 31?, and some loss in accuracy if collision detection isn't work it. Masking the first half of the 64-bit hash would be one option). Discussion on gramtools suggest this may not be the best.
  • Using sparse_map for memory restricted applications
  • Using an approximate filter, e.g. khmer's count-min or squeakr's CQF

Other references:

nthash may simplify this.

It might even be worth trying this instead of rollinghash in the main function, but this choice needs to be finalized now as it will change sketch results

nthash (6a4a4a2) looks to be slightly quicker on both tests, and has similar memory use. This might be due to by poor seqio.hpp move code for the next and out, or just that they integrated all of this with c strings anyway.

But whatever the case I will switch to this, especially as:

  • Any speed boost is welcome
  • The code becomes a lot simpler
  • Multiple hashes can be generated trivially and efficiently, which could come in handy with a future filter replacement
        rollinghash
        
        
        Command being timed: "./sketch_test 12754_5#71 12754_5#71.contigs_velvet.fa 12754_5#72 12754_5#72.contigs_velvet.fa rfiles.txt"
        User time (seconds): 73.21
        System time (seconds): 0.69
        Percent of CPU this job got: 363%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.35
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 48844
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 34150
        Voluntary context switches: 1307
        Involuntary context switches: 177
        Swaps: 0
        File system inputs: 0
        File system outputs: 39040
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
        
        Command being timed: "./read_test 12673_8_26 12673_8#26.fastq.gz 12754_4_66 12754_4#66_1.fastq.gz 12754_4#66_2.fastq.gz"
        User time (seconds): 316.31
        System time (seconds): 9.38
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 5:26.13
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1522004
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1573477
        Voluntary context switches: 47
        Involuntary context switches: 334
        Swaps: 0
        File system inputs: 392920
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
        
        nthash
        
        Command being timed: "./sketch_test 12754_5#71 12754_5#71.contigs_velvet.fa 12754_5#72 12754_5#72.contigs_velvet.fa rfiles.txt"
        User time (seconds): 64.84
        System time (seconds): 0.54
        Percent of CPU this job got: 363%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.99
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 49308
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 34260
        Voluntary context switches: 1786
        Involuntary context switches: 204
        Swaps: 0
        File system inputs: 0
        File system outputs: 39104
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
        
        Command being timed: "./read_test 12673_8_26 12673_8#26.fastq.gz 12754_4_66 12754_4#66_1.fastq.gz 12754_4#66_2.fastq.gz"
        User time (seconds): 252.22
        System time (seconds): 10.66
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:23.58
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1522152
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 1569866
        Voluntary context switches: 63
        Involuntary context switches: 180
        Swaps: 0
        File system inputs: 1606592
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Closed in 7029c0c