cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

db: flush/compact memtables to in-memory sstables

jbowens opened this issue · comments

For various reasons, we may accumulate many memtables within the flushable queue:

  • With the current design for #3230 a temporary disk stall may result in unbounded growth of queued memtables while the stalled disk prevents flushes, but writes continue.
  • In stores with high write throughput and high flush utilization, multiple memtables may queue while memtables lower in the flushable queue are still being flushed.
  • In #1421 we consider deliberately postponing flushes to allow more memtables to queue, improving the shape of L0 sublevels and reducing write amplification by allowing additional data to be elided before being written to L0 (eg, raft log truncation, intent resolution and overwritten expiration leases can all significantly reduce the volume of data that makes it to L0, and the more data batched within the flush the larger the benefit).

However, memtables use significant memory. Keys and values are stored verbatim, uncompressed, even though we expect ~half of writes (the raft log) to not be read. Each KV pair also has at least 32 bytes of arenaskl.Node overhead, plus additional overhead for nodes with skiplist towers that are not the minimal height.

@petermattis recently suggested flushing memtables to in-memory sstables. This has the benefit of representing the same data more compactly. The sstable representation is more compact by reducing the fixed per-key overhead and taking advantage of block prefix key compression. If the data blocks are also compressed and the data is compressible, the data is stored even more compactly, with the tradeoff that reads that must read through the table may need to duplicate some of the data uncompressed within the block cache. When it comes time to durably flush the state to L0, the sstable(s) may also be copied verbatim to storage. We could additionally consider "compactions" of memtables/in-memory sstables, further delaying the eventual write out to L0.