A fast & memory-efficient data structure that cat tell if it saw a string before
In case you need a fast, memory-efficient data structure that would remember whether or not it had seen a string before, for example you are running through a list of filename with possible duplicates that you do not want to double-process, or through long list of emails/GUIDs with possible duplicate that you wanna deal with, here is the module for you.
The Node.js/c++ binding is 2/3 times faster, and uses 6 times less memory than a JavaScript implementation. The pure c++ is even faster and uses 8 times less memory than the nodejs binding, that is 200% faster than pure JavaScript, and 50 times (5000%) more memory-efficient*. Check the Time Table for details.
* For some reason arguments passed from JavaScript to c++ lingers in memory, if you have a way to set them free let me know
You will need google sparsehash, under Ubuntu you can do sudo apt-get install sparsehash
.
Check the test/time-binding.js file for an example usage.
var elephant = require('elephant')
var obj = new elephant.Elephant();
console.log(obj.memorize("hello")); // false: first time to be seen, increases stats_added by 1
console.log(obj.memorize("hello")); // true: seen before, increases stats_duplicates by 1
console.log(obj.remember("hello")); // true: remembered, told before to memorize it
console.log(obj.remember("hi")); // false: not remembered, as has not been told to memorize it
console.log(obj.remember("hi")); // false: still not remembered, of course
console.log(obj.stats()); // { added: 1, duplicates: 1 }
console.log('Success, because you can read this : )');
bool .memorize("string")
tells the elephant to memorize a string, returnstrue
if the string was seen before, andfalse
otherwise.bool .remember("string")
ask the elephant whether it remembers a string (has seen, aka told to memorize that string before), returnstrue
if the string was seen before, andfalse
otherwise.object .stats()
returns some stats.
Numbers between (brackets) refer to the difference between current cell value and the one to its left.
Command being timed | "build/time-native 5000000" | "node test/time-binding.js 5000000" | "node test/time-node.js 5000000" |
---|---|---|---|
User time (seconds) | 10.27 | 14.6 (142%) | 22.18 (152%) |
System time (seconds) | 0 | 0.1 (Infinity%) | 0.41 (410%) |
Percent of CPU this job got | 99% | 100% | 100% |
Elapsed (wall clock) time (h:mm:ss or m:ss) | 0:10.31 | 0:14.65 | 0:22.53 |
Average shared text size (kbytes) | 0 | 0 (=) | 0 (=) |
Average unshared data size (kbytes) | 0 | 0 (=) | 0 (=) |
Average stack size (kbytes) | 0 | 0 (=) | 0 (=) |
Average total size (kbytes) | 0 | 0 (=) | 0 (=) |
Maximum resident set size (kbytes) | 44576 | 364368 (817%) | 2188960 (601%) |
Average resident set size (kbytes) | 0 | 0 (=) | 0 (=) |
Major (requiring I/O) page faults | 0 | 0 (=) | 0 (=) |
Minor (reclaiming a frame) page faults | 2997 | 23614 (788%) | 167480 (709%) |
Voluntary context switches | 1 | 14790 (1479000%) | 19914 (135%) |
Involuntary context switches | 872 | 1253 (144%) | 2008 (160%) |
Swaps | 0 | 0 (=) | 0 (=) |
File system inputs | 0 | 0 (=) | 0 (=) |
File system outputs | 0 | 0 (=) | 0 (=) |
Socket messages sent | 0 | 0 (=) | 0 (=) |
Socket messages received | 0 | 0 (=) | 0 (=) |
Signals delivered | 0 | 0 (=) | 0 (=) |
Page size (bytes) | 4096 | 4096 (=) | 4096 (=) |
Exit status | 0 | 0 (=) | 0 (=) |
- Add more hashes options, besides the default 32-bit MurmurHash3, probably FNV-1 32-bit & 64-bit..
© 2013 Hasan Arous. All rights reserved.
Used libraries are copyrighted to their owners, as seen in files comments.