chenkovsky / poplar-trie

C++17 library of associative arrays with string keys based on a dynamic path-decomposed trie

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Poplar-trie

Poplar-trie is a C++17 library of associative arrays with string keys based on a dynamic path-decomposed trie (DynPDT) described in the paper Practical implementation of space-efficient dynamic keyword dictionaries, published in SPIRE 2017 [paper] [slide]. However, the implementation of this library is enhanced from the conference version.

The technical details are now being written.

Implementation overview

Poplar-trie implements an associative array giving a mapping from key strings to values of any type and supporting dynamic update like std::map<std::string,V>. The underlying data structure is the DynPDT.

A property of the DynPDT is that the edge labels are drawn from an integer set larger than that of normal tries represented in one byte, so it is important that searching a child can be performed in constant time. Poplar-trie solves the task using hash-based trie implementations of the following two classes:

  • HashTriePR is a plain representation of a hash table.
  • HashTrieCR is a compact representation of a hash table based on m-Bonsai.

Another property is that the trie has string labels for each node, so their pointers have to be stored. This library includes the three management methods:

  • LabelStorePM simply stores all pointers to string labels.
  • LabelStoreEM embeds short string labels into spaces of pointers.
  • LabelStoreGM reduces the overhead by grouping pointers in the same manner as sparsehash.

Class Map implements the associative array and takes HashTrie* and LabelStore* as the template arguments. That is to say, there are implementations of six classes. But, you can easily get the implementations since poplar.hpp provides the following aliases:

  • MapPP = Map + HashTriePR + LabelStorePM (fastest)
  • MapPE = Map + HashTriePR + LabelStoreEM
  • MapPG = Map + HashTriePR + LabelStoreGM
  • MapCP = Map + HashTrieCR + LabelStorePM
  • MapCE = Map + HashTrieCR + LabelStoreEM
  • MapCG = Map + HashTrieCR + LabelStoreGM (smallest)

These have template argument t_lambda = 16 in common. This is a parameter depending on lengths of given strings. From previous experimental results, the value 16 (default) would be good for natural language words. For long strings such as URLs, the value 32 or 64 would be good.

Build instructions

You can download and compile Poplar-trie as the following commands.

$ git clone https://github.com/kampersanda/poplar-trie.git
$ cd poplar-trie
$ mkdir build
$ cd build
$ cmake .. -DPOPLAR_USE_POPCNT=ON
$ make
$ make install

This library uses C++17, so please install g++ 7.0 (or greater) or clang 4.0 (or greater). As can be seen in the above commands, CMake 3.8 (or greater) has to be installed to compile the library. You can use the SSE4.2 POPCNT instruction by adding -DPOPLAR_USE_POPCNT=ON.

Easy example

The following code is an easy example of inserting and searching key-value pairs.

#include <iostream>
#include <poplar.hpp>

int main() {
  std::vector<std::string> keys = {
    "Aoba", "Yun", "Hajime", "Hihumi",
    "Kou", "Rin", "Hazuki", "Umiko", "Nene"
  };

  poplar::MapPP<int> map;

  try {
    for (int i = 0; i < keys.size(); ++i) {
      int* ptr = map.update(keys[i]);
      *ptr = i + 1;
    }
    for (int i = 0; i < keys.size(); ++i) {
      const int* ptr = map.find(keys[i]);
      if (ptr == nullptr || *ptr != i + 1) {
        return 1;
      }
      std::cout << keys[i] << ": " << *ptr << std::endl;
    }
    {
      const int* ptr = map.find("Hotaru");
      if (ptr != nullptr) {
        return 1;
      }
      std::cout << "Hotaru: " << -1 << std::endl;
    }
  } catch (const poplar::Exception& ex) {
    std::cerr << ex.what() << std::endl;
    return 1;
  }

  std::cout << "# of keys is " << map.size() << std::endl;

  return 0;
}

The output will be

Aoba: 1
Yun: 2
Hajime: 3
Hihumi: 4
Kou: 5
Rin: 6
Hazuki: 7
Umiko: 8
Nene: 9
Hotaru: -1
# of keys is 9

Benchmarks

The main advantage of Poplar-trie is high space efficiency as can be seen in the following results.

The experiments were carried out on Intel Xeon E5 @3.5 GHz CPU, with 32 GB of RAM, running Mac OS X 10.12. The codes were compiled using Apple LLVM version 8 (clang-8) with optimization -O3. The dictionaries were constructed by inserting all page titles from Japanese Wikipedia (32.3 MiB) in random order. The value type is int. The maximum resident set size during construction was measured using the /usr/bin/time command. The insertion time was also measured using std::chrono::duration_cast. And, search time for the same strings was measured.

Implementation Space (MiB) Insertion (micros / key) Search (micros / key)
MapPP 80.4 0.68 0.48
MapPE 75.6 0.91 0.57
MapPG 47.2 1.71 0.80
MapCP 65.5 0.81 0.54
MapCE 61.6 1.00 0.61
MapCG 42.3 1.62 0.85
JudySL 72.7 0.73 0.49
hat-trie 74.5 0.97 0.25
cedarpp 94.7 0.69 0.42

Todo

  • Support the deletion operation
  • Add comments to the codes
  • Create the API document

Related work

  • compact_sparse_hash is an efficient implementation of a compact associative array with integer keys.
  • mBonsai is the original implementation of succinct dynamic tries.
  • tudocomp includes many dynamic trie implementations for LZ factorization.

Special thanks

Thanks to Dr. Dominik Köppl I was able to create the bijective hash function in bijective_hash.hpp.

About

C++17 library of associative arrays with string keys based on a dynamic path-decomposed trie

License:MIT License


Languages

Language:C++ 96.3%Language:CMake 3.7%