memgraph / mgcxx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mgcxx (experimental)

A collection of C++ wrappers around non-C++ libraries. The list includes:

  • full-text search enabled by tantivy

Requirements:

  • cmake 3.15+
  • rustup toolchain 1.75.0+

How to build and test?

mkdir build && cd build
cmake ..
make && ctest

text_search

TODOs

  • Polish & test all error messages
  • Write unit / integration test to compare STRING vs JSON fiels search query syntax.
  • Figure out what's the right search syntax for a property graph
  • Add some notion of pagination
  • Add some notion of backwards compatiblity -> some help to the user
  • How to:
    • search all properties
    • fuzzy search // let term = Term::from_field_text(data_field, &input.search_query); // let query = FuzzyTermQuery::new(term, 2, true);
  • Add Github Actions
  • Add benchmarks:
    • Test what's the tradeoff between searching STRING vs JSON TEXT, how does the query look like?
    • Search direct field vs JSON, FAST vs SLOW, String vs CxxString
    • MATCH (n) RETURN count(n), n.deleted;
    • search of a specific property value
    • benchmark (add|retrieve simple|complex, filtering, aggregations).
    • search of all properties
    • Benchmark (search by GID to get document_id + fetch document by document_id) vs (fetch document by document_id) on 100M nodes + 100M edges
      • Note DocAddress is composed of 2 u32 but the SegmentOrdinal is tied to the Searcher -> is it possible/wise to cache the address (SegmentId is UUID)
        • A searcher per transaction -> cache DocAddress inside Memgraph's ElementAccessors?
  • Implement the stress test by adding & searching to the same index concurrently + large dataset generator.
  • Consider implementing panic! handler preventing outside process to crash (optionally).

NOTEs

  • if a field doesn't get specified in the schema, it's ignored
  • TEXT means the field will be tokenized and indexed (required to be able to search)
  • Tantivy add_json_object accepts serde_json::map::Map<String, serde_json::value::Value>
  • C++ text-search API is snake case because it's implemented in Rust
  • Writing each document and then committing (writing to disk) will be expensive. In a standard OLTP workload that's a common case -> introduce some form of batching.

Resources

About

License:MIT License


Languages

Language:Rust 56.3%Language:C++ 29.6%Language:CMake 10.8%Language:Shell 3.3%