ScaleStore

This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD’22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. Paper can be found here: Paper Link ACM or Paper Link PDF

Abstract

In this paper, we propose ScaleStore, a novel distributed storage engine that exploits DRAM caching, NVMe storage, and RDMA networking to achieve high performance, cost-efficiency, and scalability at the same time. Using low latency RDMA messages, ScaleStore implements a transparent memory abstraction that provides access to the aggregated DRAM memory and NVMe storage of all nodes. In contrast to existing distributed RDMA designs such as NAM-DB or FaRM, ScaleStore integrates seamlessly with NVMe SSDs, lowering the overall hardware cost significantly. The core of ScaleStore is a distributed caching strategy that dynamically decides which data to keep in memory (and which on SSDs) based on the workload. The caching protocol also provides strong consistency in the presence of concurrent data modifications. In our YCSB-based evaluation, we show that ScaleStore can provide high performance for various types of workloads (read/write-dominated, uniform/skewed) even when the data size is larger than the aggregated memory of all nodes. We further show that ScaleStore can efficiently handle dynamic workload changes and support elasticity.

Citation

@inproceedings{DBLP:conf/sigmod/0001BL22,
  author    = {Tobias Ziegler and
               Carsten Binnig and
               Viktor Leis},
  title     = {ScaleStore: {A} Fast and Cost-Efficient Storage Engine using DRAM,
               NVMe, and {RDMA}},
  booktitle = {{SIGMOD} '22: International Conference on Management of Data, Philadelphia,
               PA, USA, June 12 - 17, 2022},
  pages     = {685--699},
  publisher = {{ACM}},
  year      = {2022},
  url       = {https://doi.org/10.1145/3514221.3526187},
  doi       = {10.1145/3514221.3526187}
}

Setup

Cluster Setup

All experiments were conducted on a 5-node cluster running Ubuntu 18.04.1 LTS, with Linux 4.15.0 kernel. Each node is equipped with two Intel(R) Xeon(R) Gold 5120 CPUs (14 cores), 512 GB main-memory split between both sockets, and four Samsung SSD 980 Pro M.2 1 TB connected via PCIe by one ASRock Hyper Quad M.2 PCIe card. The nodes of the cluster are connected with an InfiniBand network using one Mellanox ConnectX-5 MT27800 NICs (InfiniBand EDR 4x, 100 Gbps) per node.

Mellanox RDMA

We used the following Mellanox OFED installation:

ofed_info

MLNX_OFED_LINUX-5.1-2.5.8.0 (OFED-5.1-2.5.8):
Installed Packages:
-------------------
ii  ar-mgr                                        1.0-0.3.MLNX20200824.g8577618.51258     amd64        Adaptive Routing Manager
ii  dapl2-utils                                   2.1.10.1.mlnx-OFED.51258                amd64        Utilities for use with the DAPL libraries
ii  dpcp                                          1.1.0-1.51258                           amd64        Direct Packet Control Plane (DPCP) is a library to use Devx
ii  dump-pr                                       1.0-0.3.MLNX20200824.g8577618.51258     amd64        Dump PathRecord Plugin
ii  hcoll                                         4.6.3125-1.51258                        amd64        Hierarchical collectives (HCOLL)
ii  ibacm                                         51mlnx1-1.51258                         amd64        InfiniBand Communication Manager Assistant (ACM)
ii  ibdump                                        6.0.0-1.51258                           amd64        Mellanox packets sniffer tool
ii  ibsim                                         0.9-1.51258                             amd64        InfiniBand fabric simulator for management
ii  ibsim-doc                                     0.9-1.51258                             all          documentation for ibsim
ii  ibutils2                                      2.1.1-0.126.MLNX20200721.gf95236b.51258 amd64        OpenIB Mellanox InfiniBand Diagnostic Tools
ii  ibverbs-providers:amd64                       51mlnx1-1.51258                         amd64        User space provider drivers for libibverbs
ii  ibverbs-utils                                 51mlnx1-1.51258                         amd64        Examples for the libibverbs library
ii  infiniband-diags                              51mlnx1-1.51258                         amd64        InfiniBand diagnostic programs
ii  iser-dkms                                     5.1-OFED.5.1.2.5.3.1                    all          DKMS support fo iser kernel modules
ii  isert-dkms                                    5.1-OFED.5.1.2.5.3.1                    all          DKMS support fo isert kernel modules
ii  kernel-mft-dkms                               4.15.1-100                              all          DKMS support for kernel-mft kernel modules
ii  knem                                          1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
ii  knem-dkms                                     1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
ii  libdapl-dev                                   2.1.10.1.mlnx-OFED.51258                amd64        Development files for the DAPL libraries
ii  libdapl2                                      2.1.10.1.mlnx-OFED.51258                amd64        The Direct Access Programming Library (DAPL)
ii  libibmad-dev:amd64                            51mlnx1-1.51258                         amd64        Development files for libibmad
ii  libibmad5:amd64                               51mlnx1-1.51258                         amd64        Infiniband Management Datagram (MAD) library
ii  libibnetdisc5:amd64                           51mlnx1-1.51258                         amd64        InfiniBand diagnostics library
ii  libibumad-dev:amd64                           51mlnx1-1.51258                         amd64        Development files for libibumad
ii  libibumad3:amd64                              51mlnx1-1.51258                         amd64        InfiniBand Userspace Management Datagram (uMAD) library
ii  libibverbs-dev:amd64                          51mlnx1-1.51258                         amd64        Development files for the libibverbs library
ii  libibverbs1:amd64                             51mlnx1-1.51258                         amd64        Library for direct userspace use of RDMA (InfiniBand/iWARP)
ii  libibverbs1-dbg:amd64                         51mlnx1-1.51258                         amd64        Debug symbols for the libibverbs library
ii  libopensm                                     5.7.3.MLNX20201102.e56fd90-0.1.51258    amd64        Infiniband subnet manager libraries
ii  libopensm-devel                               5.7.3.MLNX20201102.e56fd90-0.1.51258    amd64        Developement files for OpenSM
ii  librdmacm-dev:amd64                           51mlnx1-1.51258                         amd64        Development files for the librdmacm library
ii  librdmacm1:amd64                              51mlnx1-1.51258                         amd64        Library for managing RDMA connections
ii  mlnx-ethtool                                  5.4-1.51258                             amd64        This utility allows querying and changing settings such as speed,
ii  mlnx-iproute2                                 5.6.0-1.51258                           amd64        This utility allows querying and changing settings such as speed,
ii  mlnx-ofed-kernel-dkms                         5.1-OFED.5.1.2.5.8.1                    all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ofed-kernel-utils                        5.1-OFED.5.1.2.5.8.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
ii  mpitests                                      3.2.20-5d20b49.51258                    amd64        Set of popular MPI benchmarks and tools IMB 2018 OSU benchmarks ver 4.0.1 mpiP-3.3 IPM-2.0.6
ii  mstflint                                      4.14.0-3.51258                          amd64        Mellanox firmware burning application
ii  openmpi                                       4.0.4rc3-1.51258                        all          Open MPI
ii  opensm                                        5.7.3.MLNX20201102.e56fd90-0.1.51258    amd64        An Infiniband subnet manager
ii  opensm-doc                                    5.7.3.MLNX20201102.e56fd90-0.1.51258    amd64        Documentation for opensm
ii  perftest                                      4.4+0.5-1                               amd64        Infiniband verbs performance tests
ii  rdma-core                                     51mlnx1-1.51258                         amd64        RDMA core userspace infrastructure and documentation
ii  rdmacm-utils                                  51mlnx1-1.51258                         amd64        Examples for the librdmacm library
ii  sharp                                         2.2.2.MLNX20201102.b26a0fd-1.51258      amd64        SHArP switch collectives
ii  srp-dkms                                      5.1-OFED.5.1.2.5.3.1                    all          DKMS support fo srp kernel modules
ii  srptools                                      51mlnx1-1.51258                         amd64        Tools for Infiniband attached storage (SRP)
ii  ucx                                           1.9.0-1.51258                           amd64        Unified Communication X

SSD

4x 512 GB main-memory split between both sockets, and four Samsung SSD 980 Pro M.2 1 TB connected via PCIe by one ASRock Hyper Quad M.2 PCIe card. All SSDs are used as block device and organized as a RAID 0 via

sudo mdadm --create /dev/md0 --auto md --level=0 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Huge pages

We are using huge pages for the memory buffers:

echo N | sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

CMake build

To build ScaleStore we use CMake. First we create a build folder in the top level folder of scalestore:

mkdir build
cd build

Afterwards, we can build the executable with either in debug mode with address sanitizers enabled:

cmake -D CMAKE_C_COMPILER=gcc-10 -D CMAKE_CXX_COMPILER=g++-10 -DCMAKE_BUILD_TYPE=Debug -DSANI=On .. && make -j

or in release mode:

cmake -D CMAKE_C_COMPILER=gcc-10 -D CMAKE_CXX_COMPILER=g++-10 -DCMAKE_BUILD_TYPE=Release .. && make -j

Libraries

gflags
lib_aio
ibverbs
tabulate
rdma cm

Run executable

All executables can be found in scalestore/build/frontend. For instance, the follwoing command can be used to run ycsb in a single node setup:

make -j && numactl --membind=0 --cpunodebind=0  ./ycsb -ownIp=172.18.94.80 -nodes=1 -YCSB_all_workloads -worker=20 -YCSB_tuple_count=1000000000 -dramGB=150 -csvFile=singlenode_oom_scalestore_ycsb_zipf.csv  -YCSB_run_for_seconds=60 -ssd_path=/dev/md0 --ssd_gib=400 -pageProviderThreads=4 -YCSB_all_zipf

Configuration

The main configuration file in order to execute ScaleStore can be found in shared-headers/Defs.hpp.

IPs

To configure the servers and their ips the following configuration needs to be adapted:

const std::vector<std::vector<std::string>> NODES{
    {""},                                                                                              // 0 to allow direct offset
    {"172.18.94.80"},                                                                                  // 1
    {"172.18.94.80", "172.18.94.70"},                                                                  // 2
    {"172.18.94.80", "172.18.94.70", "172.18.94.10"},                                                  // 3
    {"172.18.94.80", "172.18.94.70", "172.18.94.10", "172.18.94.20"},                                  // 4
    {"172.18.94.80", "172.18.94.70", "172.18.94.10", "172.18.94.20", "172.18.94.40"},                  // 5
    {"172.18.94.80", "172.18.94.70", "172.18.94.10", "172.18.94.20", "172.18.94.40", "172.18.94.30"},  // 6
};

CPU Cores

We implemented a very simple CoreManager which can be found in (scalestore/backend/threads/CoreManager.hpp). All configurations are hard-coded to fit our servers (2 NUMA nodes) and might need to be adapted to fit yours.

Gflags help

Besides the Defs.hpp file there are gflags parameters. Most of them are stored in backend/ScaleStore/Config.hpp. However, some are attached to the main executable file, e.g. ycsb has the YCSB_tuple_count flag. To see all (custom) gflags parameters and their description one can run:

./exe --help

Paper Benchmarks

The paper benchmark implementations can be found in frontend/ycsb. The distributed experiment runner scripts can be found in distexperiments/experiments. In order to run them please consult the following github page: https://github.com/mjasny/distexprunner

Benchmark Runners

YCSB runner
OLAP scan queries

Tests

consistency checks
TPC-C consistency checks

Known Issues/Bugs

Startup

If you see the following exception at the startup of ScaleStore:

"Consider adjusting BATCH_SIZE and PARTITIONS"
in /home/tziegler/ScaleStore/backend/scalestore/storage/buffermanager/Buffermanager.cpp:62

You would need to change the PARTITIONS and BATCH_SIZE variable in the Defs.hpp file. The reason is that we use a partitioned queue of batches to reduce contention in the free lists and accesses to the latch. To calculate the right number of batches per partition we use.

NUMBER_BATCHES = (DRAM_SIZE / PAGE_SIZE) / PARTITIONS / BATCH_SIZE

Therefore, this may be needed if the DRAM_SIZE is too small or the page size has been changed.

DataManagementLab / ScaleStore