Survey if DARTS (double array trie) could replace SwissTable for dictionary.

Question

Survey if DARTS (double array trie) could replace SwissTable for dictionary.

MnO2 opened this issue 5 years ago · comments

Double array trie is adopted in HanNLP for chinese segmentation, it has good properties and could be considered to be benchmarked against hashbrown implementation of SwissTable

I did naive benchmarking with the PATRICIA trie and critbit-trie implementation that I could find from crates.io but they are much slower than hashbrown.

Reference implementations, rust-darts is three year old without update, probably could not be compiled.

messense · Answer 1 · Thu Jun 13 2019 13:27:06 GMT+0800 (China Standard Time)

We used radix_trie before: a6a9542

Paul Meng · Answer 2 · Sun Jun 16 2019 17:32:13 GMT+0800 (China Standard Time)

radix_trie's implementation uses Box<TrieNode<K, V>>, where I suppose it would result into lots of memory allocation and therefore slow it down. https://github.com/michaelsproul/rust_radix_trie/blob/master/src/trie_node.rs

DARTS looks like a different implementation where it only require array index access by a brief reading. It worths some investment of my time to play with it given that other Chinese segmentor implementation depends on it.

Paul Meng · Answer 3 · Tue Jun 18 2019 18:04:44 GMT+0800 (China Standard Time)

Sequence Trie supports preifx_iter: https://github.com/michaelsproul/rust_sequence_trie/blob/master/src/lib.rs#L443

            while i < word_count {
                if let Some(freq) = self.dict.get(wfrag).map(|x| x.0) {
                    if freq > 0 {
                        tmplist.push(i);
                    }
                    i += 1;
                    wfrag = if i + 1 < word_count {
                        let byte_end = char_indices[i + 1];
                        &sentence[byte_start..byte_end]
                    } else {
                        &sentence[byte_start..]
                    };
                } else {
                    break;
                }
            }

for the section we don't really need to iterate until word_count, we only need to iterate the prefixes against the given substring, where it is supported by some of the Trie.

Paul Meng · Answer 4 · Tue Jun 18 2019 18:05:38 GMT+0800 (China Standard Time)

In the IP routing use case, bloom filter is considered. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.538.5666&rep=rep1&type=pdf

Andelf · Answer 5 · Fri Jun 21 2019 17:46:05 GMT+0800 (China Standard Time)

Sorry for the inconvenience.
https://github.com/andelf/rust-darts is very outdated. I'll try to fix it.

Andelf · Answer 6 · Fri Jun 21 2019 18:39:01 GMT+0800 (China Standard Time)

UPDATE:

https://github.com/andelf/rust-darts now ready for rust nightly.

I've updated the dependencies and fixed build errors.
Now it's ok to run some tests.
Some old rust coding style (3 years ago) may affect performance.

Paul Meng · Answer 7 · Fri Jun 21 2019 18:47:08 GMT+0800 (China Standard Time)

@andelf Wow, that's fast. :-) I was thinking that it's alright for me to update it as long as the license is not an issue, the code is self-explainable and basically following the style of the C++ implementation of darts. Thanks for the update!

Paul Meng · Answer 8 · Sat Jun 22 2019 12:18:28 GMT+0800 (China Standard Time)

I did naive micro-benchmarking with criterion, and here is the result.

dat prefix search       time:   [269.82 ns 273.11 ns 277.59 ns]
                        change: [-2.5627% -1.0033% +0.6578%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

hashbrown prefix search time:   [258.18 ns 263.23 ns 272.87 ns]
                        change: [-1.2466% +0.3418% +1.9128%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

dat match found         time:   [6.0955 ns 6.1448 ns 6.2088 ns]
                        change: [+2.9946% +4.1448% +5.2864%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

hashbrown match found   time:   [14.951 ns 14.971 ns 14.992 ns]
                        change: [-3.6464% -2.1758% -0.8860%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  10 (10.00%) high severe

dat match not found fast fail
                        time:   [3.7179 ns 3.7259 ns 3.7369 ns]
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  10 (10.00%) high severe

hashbrown match not found fast fail
                        time:   [7.0072 ns 7.0414 ns 7.0985 ns]
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  11 (11.00%) high severe

dat match not found slow fail
                        time:   [22.866 ns 22.891 ns 22.918 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe

hashbrown match not found slow fail
                        time:   [7.1389 ns 7.1988 ns 7.2707 ns]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

The hashbrown is really really fast, even in the case of prefix search where I expect DATS should win, hashbrown is still running slightly faster.

The reason might be because there are a lot of slow fail cases in the Chinese segmentation cases, where hashbrown has a steady performance, and DARTS would run slower when the matching fails at the latter part of the string matching.

For the memory footprint the current implementation of DARTS is neither at advantage, it took about 20MB (22486408 bytes), and hashbrown has about 917504 elements and each elements should be 8 bytes of hash values and (key, value) for the recrods itself. It's not apple to apple. If we use a rough estimate by having (8 bytes of hash value + key length) to be counted as part of table storage, the DARTS would only win when on average (key length + string len) is greater than 16 bytes.

There are still optimization could be done to the DARTS implementation, for example

Using UTF8 for the key, that is treat key as byte but not by using Unicode Scalar.
Suffix compression
But I don't think it would make DARTS significantly outperform hashbrown implementation in Chinese segmentation use case. The space compression case may be helpful since it would reduce the size of the table and make the DARTS fit into cache.

This observation is against my intuition, and surprising to me. Hashbrown implementation is really powerful and hard to beat.

For reference, the code snippet is here

fn bench_dat_prefix_search() {
    DA.common_prefix_search("东湖高新技术开发区").unwrap();
}

fn bench_hashbrown_prefix_search() {
    let sentence: &str = "东湖高新技术开发区";
    let char_indices: Vec<usize> = sentence.char_indices().map(|x| x.0).collect();

    let word_count = char_indices.len();
    for (k, &byte_start) in char_indices.iter().enumerate() {
        let mut i = k;
        let mut wfrag = if k + 1 < char_indices.len() {
            &sentence[byte_start..char_indices[k + 1]]
        } else {
            &sentence[byte_start..]
        };

        while i < word_count {
            if HASHMAP.contains_key(wfrag) {
                //do nothing
            }

            i += 1;
        }
    }
}

fn bench_dat_match_found() {
    DA.exact_match_search("我是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰。");
}

fn bench_hashbrown_match_found() {
    HASHMAP.contains_key("我是拖拉机学院手扶拖拉机专业的。不用多久，我就会升职加薪，当上CEO，走上人生巅峰。");
}

fn bench_dat_match_not_found_slow_fail() {
    DA.exact_match_search("东湖高新技术开发区abcdef");
}

fn bench_hashbrown_match_not_found_slow_fail() {
    HASHMAP.contains_key("东湖高新技术开发区abcdef");
}

fn bench_dat_match_not_found_fast_fail() {
    DA.exact_match_search("abcdef东湖高新技术开发区");
}

fn bench_hashbrown_match_not_found_fast_fail() {
    HASHMAP.contains_key("abcdef东湖高新技术开发区");
}

Paul Meng · Answer 9 · Sat Jun 22 2019 20:34:03 GMT+0800 (China Standard Time)

With slight modification in rust-darts the prefix search time is cut in half. That is to allocate the vector in one batch.

    pub fn common_prefix_search(&self, key: &str) -> Option<Vec<(usize, usize)>> {
        let mut result = Vec::with_capacity(10);

dat prefix search       time:   [100.46 ns 101.38 ns 102.65 ns]
                        change: [-34.176% -30.643% -27.245%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

hashbrown prefix search time:   [262.81 ns 263.76 ns 264.92 ns]
                        change: [-39.026% -35.025% -30.948%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  11 (11.00%) high severe

Paul Meng · Answer 10 · Mon Jun 24 2019 19:31:41 GMT+0800 (China Standard Time)

I rewrote the code and run a preliminary bench tests at my local, the result looks promising once this pull request is merged: andelf/rust-darts#12

The performance boost is roughly 25% on the small inputs, with ad-hoc run on weicheng it seems to be 30% of performance boost as well.

jieba cut no hmm        time:   [10.057 us 10.157 us 10.265 us]
                        change: [-3.0766% -2.0162% -0.9084%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

jieba with darts cut no hmm time:   [7.5082 us 7.5655 us 7.6285 us]
                        change: [+3.9420% +5.1029% +6.3200%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

I suggest after we merge the pull request for prefix-iter then we can publish rust-darts to crates.io so that jieba-rs would not rely on local builds. What do you think @andelf ?

Paul Meng · Answer 11 · Mon Jun 24 2019 22:34:07 GMT+0800 (China Standard Time)

For the reference, if using aho-corasick, the performance is like the following.

jieba cut no hmm        time:   [11.168 us 11.224 us 11.287 us]
                        change: [-14.370% -9.3553% -4.1684%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

jieba static cut no hmm time:   [8.4451 us 8.5658 us 8.7150 us]
                        change: [-10.733% -6.1712% -1.8499%] (p = 0.01 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  6 (6.00%) high severe

Paul Meng · Answer 12 · Tue Jun 25 2019 13:01:05 GMT+0800 (China Standard Time)

I put my optimized code in my branch (still need to wait for rust-darts): https://github.com/MnO2/jieba-rs/blob/jieba-static/src/unstable/mod.rs

➜  jieba-rs git:(jieba-static) ✗ ./target/release/weicheng
Jieba Elapsed: 6611 ms
Jieba Unstable Elapsed: 5072 ms

Not it runs slightly faster than C++ version on the weicheng task.

➜  build git:(master) ✗ ./load_test
process [100 %]
Cut: [5.521 seconds]time consumed.
process [100 %]
Extract: [0.710 seconds]time consumed.

messense · Answer 13 · Tue Jun 25 2019 15:50:05 GMT+0800 (China Standard Time)

@MnO2 I'd rather not use bincode to serialize/deserialize data to/from disk, it's not a format designed with backward compatibility in mind. If we're going to distribute binary model, I'd use protobuf or flatbuffers.

bincode-org/bincode#221 (comment)

Paul Meng · Answer 14 · Tue Jun 25 2019 18:49:43 GMT+0800 (China Standard Time)

Good point on backward compatibility, and other language's support would also be a problem as well. Let me talk about the whole context, it actually involves a few engineering design trade-offs.

Double Array Trie, though possible to support dynamic insertion and deletion (i.e. out of order insertion), it is more complicated to implement.
For the current implementation a big assumption is that the dictionary is passed in lexicographical order, so it is more efficient to build the trie and easier to implement. The downside of that is it would take about 30s to sort and build the trie even on my Core-i7 Macbook. And we have to hide the add_word api from the users. Therefore there are a few approaches both with their goods and bads.

i. Using a serialization format to remember the sorted order and the memory layout for rust's data structure, and it is very fast to be loaded and deserialized. We could build it for users or users could build by themselves.
ii. Make sure the text version of dictionary is sorted, or abort the program and let user know the dictionary has to be sorted. It would be slightly slower than binary format but it would skip the sorting step and therefore it would probably only take a few seconds to load. It should be OK for most of the use case.
iii. Using alternative like aho-corasick, it's performance is roughly the same as DARTS as I tested, the implementation is roughly the same as well. https://github.com/BurntSushi/aho-corasick. The library's interface is by accepting an array of &str and it would build the NFA for you, it doesn't require it to be sorted, and it would take a few seconds on my i7 Macbook to load. The benefit is that we don't need to ship another binary serialization file. However, it is also a static dictionary approach and we need to drop add_word
iv. Implement dynamic insertion for DARTS. This one needs more time since we need to read the other implementations to understand how to implement that.

Suppose that we would like to keep it to be serialized format, just to put everything on the table first to make sure we understand the landscape.

Uber wrote a thorough blog post before: https://eng.uber.com/trip-data-squeeze/

And the benchmark I run on my laptop here
https://github.com/erickt/rust-serialization-benchmarks

running 19 tests
test capnp_deserialize                 ... bench:         254 ns/iter (+/- 41) = 1763 MB/s
test capnp_deserialize_packed          ... bench:         491 ns/iter (+/- 46) = 686 MB/s
test capnp_populate                    ... bench:         434 ns/iter (+/- 29)
test capnp_serialize                   ... bench:          24 ns/iter (+/- 4) = 18666 MB/s
test capnp_serialize_packed            ... bench:         345 ns/iter (+/- 43) = 976 MB/s
test clone                             ... bench:       1,118 ns/iter (+/- 169) = 468 MB/s
test flatbuffers_deserialize           ... bench:           0 ns/iter (+/- 0) = 472000 MB/s
test flatbuffers_populate_with_args    ... bench:         483 ns/iter (+/- 65)
test flatbuffers_populate_with_builder ... bench:         455 ns/iter (+/- 76)
test flatbuffers_serialize             ... bench:           0 ns/iter (+/- 0) = 472000 MB/s
test rmp_serde_deserialize             ... bench:       1,644 ns/iter (+/- 116) = 174 MB/s
test rmp_serde_serialize               ... bench:         247 ns/iter (+/- 10) = 1161 MB/s
test rust_bincode_deserialize          ... bench:       1,330 ns/iter (+/- 158) = 300 MB/s
test rust_bincode_serialize            ... bench:         155 ns/iter (+/- 13) = 2580 MB/s
test rust_protobuf_deserialize         ... bench:         479 ns/iter (+/- 13) = 597 MB/s
test rust_protobuf_populate            ... bench:       1,299 ns/iter (+/- 205)
test rust_protobuf_serialize           ... bench:         439 ns/iter (+/- 67) = 651 MB/s
test serde_json_deserialize            ... bench:       2,100 ns/iter (+/- 132) = 288 MB/s
test serde_json_serialize              ... bench:       1,193 ns/iter (+/- 205) = 507 MB/s

The downside of the protobuf or flatbuffer is that we need to maintain the interface file, but it definitely run faster than those format that doesn't require the interface file. But it might be OK for the user of the library since that's hidden for them.

With the above mentioned, one thing for sure is that we need to examine the support of add_word. Should we provide two versions of Jieba where one of them support the flexibilty of add_word? Right now I can't think of a use case where this flexibilty is a hard requirement. For search engine that could just restart the server process. For mobile client they could just reload the app. Would it be too restrictive to only provide the version with static dictionary? Or should we maintain two and let the user choose? But it may result into duplication in code.

Paul Meng · Answer 15 · Thu Jun 27 2019 15:47:56 GMT+0800 (China Standard Time)

Latest benchmark:

rust

➜  jieba-rs git:(jieba-static) ✗ ./target/release/weicheng
Jieba Elapsed: 4506 ms
Jieba Unstable Elapsed: 4174 ms

cpp

➜  build git:(master) ✗ ./load_test
process [100 %]
Cut: [5.920 seconds]time consumed.

Paul Meng · Answer 16 · Sun Jun 30 2019 00:34:24 GMT+0800 (China Standard Time)

With this change: andelf/rust-darts#19
The index building time for DARTS has been reduced to 5s on my i7 macbook-2017

Paul Meng · Answer 17 · Sun Jun 30 2019 10:52:16 GMT+0800 (China Standard Time)

Latest darts performance from this branch.

We could conclude that it would definitely improve the performance, on the ball-park figures (200ms) by removing a few memory allocations in weicheng test case.

The conclusion for this issue is clear, the rest is whether if we would like to sacrifice the flexibility of APIs for speed, or by providing both to let the users to choose if we would like to include DARTS today. The ideal way for sure is to take it slow and implement dynamic insertion and deletion for DARTS so that we don't have to tradeoff add_word and suggest_freq

jieba cut no hmm        time:   [6.4286 us 6.4485 us 6.4711 us]
                        change: [-3.2185% -2.3098% -1.4338%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

jieba cut with hmm      time:   [9.0924 us 9.1665 us 9.2776 us]
                        change: [-21.688% -20.537% -19.336%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

➜  jieba-rs git:(master) ./target/release/weicheng
3957ms

➜  jieba-rs git:(darts) ./target/release/weicheng
3768ms