aularon / node-elephant

A fast & memory-efficient data structure that cat tell if it saw a string before

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Elephant Logo

A fast & memory-efficient data structure that cat tell if it saw a string before

Summary

In case you need a fast, memory-efficient data structure that would remember whether or not it had seen a string before, for example you are running through a list of filename with possible duplicates that you do not want to double-process, or through long list of emails/GUIDs with possible duplicate that you wanna deal with, here is the module for you.

The Node.js/c++ binding is 2/3 times faster, and uses 6 times less memory than a JavaScript implementation. The pure c++ is even faster and uses 8 times less memory than the nodejs binding, that is 200% faster than pure JavaScript, and 50 times (5000%) more memory-efficient*. Check the Time Table for details.

* For some reason arguments passed from JavaScript to c++ lingers in memory, if you have a way to set them free let me know

Prerequisites

You will need google sparsehash, under Ubuntu you can do sudo apt-get install sparsehash.

Usage

Check the test/time-binding.js file for an example usage.

Example

var elephant = require('elephant')

var obj = new elephant.Elephant();

console.log(obj.memorize("hello"));	// false: first time to be seen, increases stats_added by 1
console.log(obj.memorize("hello"));	// true: seen before, increases stats_duplicates by 1
console.log(obj.remember("hello"));	// true: remembered, told before to memorize it
console.log(obj.remember("hi"));	// false: not remembered, as has not been told to memorize it
console.log(obj.remember("hi"));	// false: still not remembered, of course

console.log(obj.stats());			// { added: 1, duplicates: 1 }

console.log('Success, because you can read this : )');

API

  • bool .memorize("string") tells the elephant to memorize a string, returns true if the string was seen before, and false otherwise.
  • bool .remember("string") ask the elephant whether it remembers a string (has seen, aka told to memorize that string before), returns true if the string was seen before, and false otherwise.
  • object .stats() returns some stats.

Time Table

Numbers between (brackets) refer to the difference between current cell value and the one to its left.

Command being timed"build/time-native 5000000""node test/time-binding.js 5000000""node test/time-node.js 5000000"
User time (seconds)10.2714.6 (142%)22.18 (152%)
System time (seconds)00.1 (Infinity%)0.41 (410%)
Percent of CPU this job got99%100%100%
Elapsed (wall clock) time (h:mm:ss or m:ss)0:10.310:14.650:22.53
Average shared text size (kbytes)00 (=)0 (=)
Average unshared data size (kbytes)00 (=)0 (=)
Average stack size (kbytes)00 (=)0 (=)
Average total size (kbytes)00 (=)0 (=)
Maximum resident set size (kbytes)44576364368 (817%)2188960 (601%)
Average resident set size (kbytes)00 (=)0 (=)
Major (requiring I/O) page faults00 (=)0 (=)
Minor (reclaiming a frame) page faults299723614 (788%)167480 (709%)
Voluntary context switches114790 (1479000%)19914 (135%)
Involuntary context switches8721253 (144%)2008 (160%)
Swaps00 (=)0 (=)
File system inputs00 (=)0 (=)
File system outputs00 (=)0 (=)
Socket messages sent00 (=)0 (=)
Socket messages received00 (=)0 (=)
Signals delivered00 (=)0 (=)
Page size (bytes)40964096 (=)4096 (=)
Exit status00 (=)0 (=)

Roadmap

  • Add more hashes options, besides the default 32-bit MurmurHash3, probably FNV-1 32-bit & 64-bit..

Copyright & License

© 2013 Hasan Arous. All rights reserved.

Used libraries are copyrighted to their owners, as seen in files comments.

Mozilla Public License Version 2.0

About

A fast & memory-efficient data structure that cat tell if it saw a string before


Languages

Language:C++ 72.3%Language:JavaScript 22.9%Language:C 4.8%