Building

Currently the library was tested on Win10 + msvc v142 x64/x86 toolsets

$ git clone https://github.com/yuyoyuppe/ckvs/
$ cd ckvs
$ git submodule update --init --recursive

Note: <arch> in the following lists could be x86 or x64.

Windows

Get premake5 and add it to your PATH
Open msvc developer prompt
$ vcpkg install boost:<arch>-windows
$ premake5 vs2017 // no vs2019 support yet :<
$ cd build
$ msbuild ckvs.sln /p:Configuration=Release /p:Platform=<arch> /p:PlatformToolset=v142
$ cd bin_<arch>

Linux

Get premake5 and add it to your PATH
install boost(aptitude/pacman)
$ premake5 gmake2
$ cd build
$ make
$ cd bin_<arch>

Now you can run property_tests. Please note that you can change tmp storage paths on different hdd-types.

Design

Requirements

Concurrent hierarchical key-value supporting uint32, uint64, float, double, string, blob types capable of handling TBs of data
blobs and strings could be arbitrary length
CRUD operations on key-value pairs(currently insert/remove/find)
Key-expiration support
x86/x64 support
C++17

B+-tree

Intro

B+-tree serves as an index, since it gurantees a very small number of random lookups on HDD. Therefore, even when the amount of keys is huge, e.g. for B+-tree of order 1024 with 100 million of int64 keys, only <1%(~3500) of nodes are non-leafs, meaning that we can always keep them in RAM with a tiny footprint of ~11MB.

Concurrency

There's been many research papers on concurrent B*-like trees, and an excellent overview is available here. Some of the newer lock-free variants require nontrivial amount of effort to implement and could yield non-optimal results for an on disk DBs, see figures 13-14. The classic choice is to use a certain lock-coupling scheme: perform the "crab-walk" node-locking while traversing the tree, relasing the safe nodes above. For example, to insert some value X into a tree of height 2:

                               Root A 
                              +-------------------------+
                              | 1. Write-lock A         |
                              | 2. select appr. child   |
                              +------+------------------+
       Internal B                    |
      +-------------------------+    |
      | 3. Write-lock B         +<---+
      | 4. if safe, unlock A    |
      +-----------------------+-+
                              |
                              v Leaf C
                           +--+----------------------+
                           | 5. Write-lock C         |
                           | 6. If safe, unlock A, B |
                           +-------------------------+

We can also optimistically assume that the node C won't be overflowed after insertion, and take read-locks along the way. However, if our assumption was wrong, we would need to restart traversal from the root using write-locks once again.

Implementation notes

Using an appropriate locking primitive is important, e.g. locking boost::upgrade_mutex is ~10x slower than std::shared_mutex ^{msvc v142, Win 10, i7-6700}, but the latter requires additional support.
Distributing children of two nodes without allocating additional Node of greater arity requires very careful index manipulation
Recursive-along-the-height functions are ok, since the height is usually small enough to cause stack overflow
Storing parent handles would possibly require updating thousands of nodes on rebalancing, so we need to trace them each time during root-to-leaf traversal
Unionizing/templating node type instead of using virtual dispatch allows heavy inlining, zero copy-serialization and policy customizations
To implement slotted values which could be loaded/saved without a (de)-serialization phase, we need a new concept - assign/compare proxy, which bptree could get from a locked_node_t interface. Then it performs all operations on _slots/_keys using that proxy. Those proxies will hold a slotted_storage pointer(to be able to transfer the actual data between pages) and kv-store pointer, so they can request an additional pages for overflow

OS interaction

IO

To minimize IO overhead, B-tree nodes and various metadata structures are all sized as multiples of system page size and being manipulated asynchronously through an event-loop. While the memory mapped IO seems like an obvious first choice for the task, it lacks the explicit control over flushing(e.g. PrefetchVirtualMemory is only a suggestion, lacks Evict counterpart, Windows-only API etc.). Since the software knows exactly which pages should be flushed and loaded, it's reasonable to disable OS file-caching and use direct IO. Awesome llfio library is used to abstract platform details like OVERLAPPED etc.

Memory management, serialization

Strict memory budgeting and preallocating beforehand alleviates mostly all allocation problems at the cost of reduced flexibility. However, having tweakable at compile-time data structures allows generating required permutation set and hot-load it later.

To keep binary compatibility between stores on x86/x64, all arithmetic types are named explicitly, serialized structures do not contain any pointers and are carefully aligned. Endianness is also statically asserted. There's a peculiar issue of type punning leading to undefined behavior in C++ UB, which could be alleviated in various ways. Keeping the structures is_trivially_copyable/constructible and reinterpret_casting it from standard-blessed type seems to be a way to go.

Page cache

Hash-table of fixed capacity and bucket-level spinlocking
Page descriptors are linked-list to buckets, their lifetime is managed by lockfree stack
Hybrid LRU/random replacement policty
Need to be careful not to choose miniscule capacity to avoid livelocking(or callback cleanup from page_lock)

Page formats

Index Page

Always at 0 offset in the physical file. Pinned in the page cache. Could be partially mounted(referenced) to some virtual store.

magic sig, 4 bytes
   +      
   v      
+--+--------------------------------------+
| ckvs | name to page id b+-tree root map |
+-----------------------------------------+
             ^
             +
linearized tree of b+-tree roots, <rest of the page> bytes
handles folder structure, e.g.:
- parent_dir
|_ child_ckvs1 -> {tree_root pageID X}
|_ child_ckvs2 -> {tree root pageID Y}
|__|_grandchild_ckvs3 -> {tree root pageID Z}
...

Free Page Stack Page

When a new page is required or a cleaned one needs releasing, free page stack pages come to an aid! They're located at fixed offsets in the file, specified by the extend_granularity setting.

number of free page IDs in this page, sizeof(least_unsigned_t<max_pages>) bytes
   +      
   v      
+--+--------------------------------------------------+
|  N | free_page_id1, free_page_id2,... free_page_idN |
+----------------------------+------------------------+
                             ^
        free page ids array, <rest of the page> bytes

B+-tree Node Page

Leaf/Root, 1 byte
   |   # Keys,     If type bit mask > 10 -> not actual key/value, but a slot #id in 
                  the slotted storage, else -> actual key/value, order * 8 bytes each
   |   2 bytes     |       | 
   v       v       v       v
+--+-------+-------+-------+---+
| Kind | nKeys | Keys | Values | <- serialized node<> struct
+------------------------------+
        +                   /--------------------------slotted storage-----------------------\
        v                   |          2 bytes -v  2 bytes end_offset + 2 bytes size each    |
+-----------------------------------------------+-------------+--------V---------------------+
| B+-tree node | node types | slots_end_offset | nUsed slots  | slot descS | slot values     |
+---------------------+--------------+------------------------+ ------------------+----------+
  type bit mask, -----^              ^               binary/string slot values ---^          
 4 bits per key                      |
  00: uint64             offset from the page end at which to prepend new slot value, 2 bytes
  01: float
  10: double
  11: blob
 100: string
101X: overflown string/blob -> value in the slotted page is overflow_page_id + slot_id
111X: overflown huge string/blob -> value is the Huge Type, value in the storage = huge_page_id
                                                                                 + size in pages

Overflow Page

Used when the string/binary is smaller than page_size, but can't fit in the embedded storage on the B+-tree Node Page. Created on demand, Node Page's mutators first check max #overflow_tolerance of existing overflow pages referenced from the embedded slotted storage, and allocate new one if couldn't get a slot.

+-----------------------------------------------------------------------------------------------+
| same slotted storage structure as on B+-tree Node Page, but with a capacity of an entire page |
+-----------------------------------------------------------------------------------------------+

Huge Page

+---------------------------------------------------------------------------------------------+
| contiguous binary data, bulk-allocated from the Free Page Stack, so max size                |
|  of huge binary object = sizeof(page) * (extend_granularity - 1)                            |
+---------------------------------------------------------------------------------------------+

Expiration design

extend type mask to also contain info if the key/value in the slotted storage is also appended by timestamp

pros: smaller size overhead when the % of expiring values are small
cons: we're forced to lazily remove expired values, need GC to compact

separate B+-tree: keys are timestamps, values are common keys.

pros: fast traversal from leaf to leaf(since they store next sibling pointer)
cons: possibly big overhead on huge keys duplication

Perf tests

All times are wall clock time.
Unless specified otherwise, in all tests below B+-tree order = 1022, and page size = 16KB, all key/values are randomly shuffled uint64_t
Compiled without ASSERTS define at solution scope
Using DDR4 2133 RAMdisk. Peak witnessed throughput = ~600MB/s, typical = ~350MB/s
Note that having 10M+ values in a single physical store seem to choke it at page_cache_capacity = 32768(should this be the case? investigate why - cache thrashing/bad page settings/too much memcpying etc.) and much earlier with an even smaller capacity

Syntetic `paged_file` vs single-threaded `ofstream` random write test

Writing ~10K pages of size 16KB, async has 8 threads:
===HDD==
async completed in 1.976 sec
ofstream completed 33.468 sec
===SSD==
async completed in 1.206 sec
ofstream completed 0.784 sec
===RAM==
async completed in 1.133 sec
ofstream completed 0.541 sec

Insert 10M, remove 5M, 8 threads, query another 5M, page_cache_capacity = 32768

Insert 10M of uint64 key/values, splitted randomly across 8 threads, concurrently removing odd keys, then save to file(in RAM), reopen it and find all even keys in 1 thread:

$ property_tests.exe
OK -- 14.174sec

Remove 5M out of 5M, hot cache, page_cache_capacity = 32768

removed in 6.925 sec

Remove 5M out of 5M, cold cache, page_cache_capacity = 32768

removed in 7.321 sec

Query 1M from a cold cache, 8 threads, page_cache_capacity = 8192

Running test #0 ...
queried in 0.78 sec
OK -- 2.146sec
^ including single-threaded insert before the query threads are launched

Query 1M from a cold cache, 8 threads, page_cache_capacity = 32768

queried in 0.74 sec
OK -- 2.514sec

Query 5M from a cold cache, 8 threads, page_cache_capacity = 32768

queried in 4.951 sec
OK -- 13.751sec

Query 10M from a cold cache, 8 threads, page_cache_capacity = 32768

$ property_tests.exe
queried in 9.953 sec

Query 20M from a cold cache, 8 threads, page_cache_capacity = 8192

$ property_tests.exe
queried in 658.798 sec
OK -- 1152.19sec

Query 20M from a cold cache, 8 threads, page_cache_capacity = 32768

$ property_tests.exe
Running test #0 ...
queried in 80.643 sec
OK -- 121.129sec

todo: more? mixed load tests? 80/20 bigger cache size(relax power of 2 page_cache capacity restriction)?

yuyoyuppe / ckvs

Building

Windows

Linux

Design

Requirements

B+-tree

Intro

Concurrency

Implementation notes

OS interaction

IO

Memory management, serialization

Page cache

Page formats

Index Page

B+-tree Node Page

Overflow Page

Huge Page

Expiration design

Perf tests

Syntetic `paged_file` vs single-threaded `ofstream` random write test

Insert 10M, remove 5M, 8 threads, query another 5M, page_cache_capacity = 32768

Remove 5M out of 5M, hot cache, page_cache_capacity = 32768

Remove 5M out of 5M, cold cache, page_cache_capacity = 32768

Query 1M from a cold cache, 8 threads, page_cache_capacity = 8192

Query 1M from a cold cache, 8 threads, page_cache_capacity = 32768

Query 5M from a cold cache, 8 threads, page_cache_capacity = 32768

Query 10M from a cold cache, 8 threads, page_cache_capacity = 32768

Query 20M from a cold cache, 8 threads, page_cache_capacity = 8192

Query 20M from a cold cache, 8 threads, page_cache_capacity = 32768

About

Languages

Building

Windows

Linux

Design

Requirements

B+-tree

Intro

Concurrency

Implementation notes

OS interaction

IO

Memory management, serialization

Page cache

Page formats

Index Page

B+-tree Node Page

Overflow Page

Huge Page

Expiration design

Perf tests

Syntetic paged_file vs single-threaded ofstream random write test

Insert 10M, remove 5M, 8 threads, query another 5M, page_cache_capacity = 32768

Remove 5M out of 5M, hot cache, page_cache_capacity = 32768

Remove 5M out of 5M, cold cache, page_cache_capacity = 32768

Query 1M from a cold cache, 8 threads, page_cache_capacity = 8192

Query 1M from a cold cache, 8 threads, page_cache_capacity = 32768

Query 5M from a cold cache, 8 threads, page_cache_capacity = 32768

Query 10M from a cold cache, 8 threads, page_cache_capacity = 32768

Query 20M from a cold cache, 8 threads, page_cache_capacity = 8192

Query 20M from a cold cache, 8 threads, page_cache_capacity = 32768

About

Languages

Syntetic `paged_file` vs single-threaded `ofstream` random write test