messense / jieba-rs

The Jieba Chinese Word Segmentation Implemented in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using more sophiscated techniques for DAG

MnO2 opened this issue · comments

Right now the DAG is Vec<SmallVec<[usize; 5]>>, its layout is basically "[usize;5], [usize;5] .... [usize; 5]" in the memory (with metadata). I think there are still space for slight improvement. if we allocate in one chunk, that is Vec::with_capactiy(num_of_nodes * percentile(0.9, len_of_common_prefix))

Since the dictionary is static, we could pre-calculate the statistics from the dictionary to know pecentile(0.9, len_of_common_prefix))

If the length of the adjacent node exceeds pecentile(0.9, len_of_common_prefix)), then we could use linear probing technique from hashtable to do a linear search. And since they are adjacent in the memory, it probably would have better cache hit rate

Some ideas are from the talk of SwissTable
https://www.youtube.com/watch?v=ncHmEUmJZf4&t=8s

Merged.