chainone/distributed-data-system-comparison

Preface

While reading the book "Designing Data-Intensive Applications", I was trying to understand the essence of distributed data systems and what are the techniques commonly applied in those data systems that we (as developers) interact with everyday and what are the pros and cons.

Though the book mentioned some of those data systems throughout the book, there is no central place to categorize all the popular existing distributed data systems by different characteristics. That’s why I have the idea to make a such table so that it could help me to memorize and also it may be valuable to all the developers that are interested in this area.

Caution

The comparison table is not finished yet nor 100% correct, contributions are welcomed!

Table Snapshots

Note	As 1) The table is too wide to be shown properly; 2) Github doesn’t handle adoc very well, table snapshots are placed here for ease of viewing

Detailed comparison table

Data Systems

MySQL(Ma-Sl)

HBase

Cassandra

DynamoDB

MongoDB

ElasticSearch

Kafka

RabbitMQ

Replication

Write

Single write to leader

Single write to the target region server First write needs lookup (location of meta table) in zk first and then the region server that holds meta table and then do the actual write meta table could be cached in client

Multi-write, select any node as coordinator node and forward request to the nodes hold the target data

Single write to the primary node of the target Replica Set and then async to secondaries, reads from secondaries may return data that does not reflect the state of the data on the primary

Write to one primary shard(partition) first and then sync changed documents to replica shards asynchronously (? some docs saying otherwise)

The producer(client) uses the parameter "ack" to control how many partition replicas must receive the record before the producer can consider the write successful.

Replication Type

Single leader, all data replication

Partition based on key range, write to single region server, replica via HDFS

Leaderless, partition, multi-writes to selected nodes and with read repair if stable value is obtained

Leaderless

leader-based in one replica set, all data on primary need to be synced to secondaries

leader-based (primary shard vs replica shards)

leader-based (leader replica vs follower replica)

Caution

Replicas don’t serve any requests, it is merely for backup

leader-based

Sync/Async Replication

Configurable, default semi-sync

Kinda SYNC, Write to WAL(persistent in HDFS) and in member store and then return

Configurable, the w in quorum ((w + r > n)

Async

Async(? some docs say otherwise)

Sync, produced messages are considered “committed” when they were written to the partition on all its in-sync replicas . However, producer(client) can control how many partition replicas must receive the record before the producer can consider the write successful.

Failure Over

Follower: Follower copies latest snapshot from leader and then catch up with binlog
Leader: Followers will most recent binlog becomes leader, old leader’s unreplicated writes are simply discarded

ZK: whole cluster is down
Master: need in-active master as hot backup, otherwise DDL operations will fail
Region Servers: node recover from WAL, while recovering, that part of data is not available

NO IMPACT as long as quorum w + r > n is satisfied

Should be same as Cassandra

Primary: eligible secondary calls for an election to nominate itself as the new primary. The replica set cannot process write operations until the election completes successfully. The replica set can continue to serve read queries if such queries are configured to run on secondaries while the primary is offline.
Secondary: first copy all data from one of the members, and then apply all changes by syncing oplog

Master: HA masters, if no master, elect one other node as master.
Data nodes: if the failure node contains primary shards, one of the replica shards will be promoted (by master) to be primary, otherwise writing to that shard is not allowed

Controller: When the controller broker is stopped or loses connectivity to Zookeeper, the ephemeral node will disappear. Other brokers in the cluster will be notified through the Zookeeper watch that the controller is gone and will attempt to create the controller node in Zookeeper themselves. The first node to create the new controller in Zoo‐ keeper is the new controller.
Other brokers: for those leader replicas on this node, one of the corresponding in-sync replicas will be elected to be the leader replica. If there is no in-sync replica available, we can use "unclean.leader.election.enable" to control if we allow out-of-sync replica to be the new leader,In summary, if we allow out-of-sync replicas to become leaders, we risk data loss and data inconsistencies. If we don’t allow them to become leaders, we face lower availability.

Multi-leader Replication Topology

Circular by default (Cluster version)

Circular

NA, Secondary (Follower) chooses to sync oplog from Primary or Secondary based on ping time and the state of other secondary’s replica status

NA, Leader(node with that primary shard) forwards changed documents to nodes with replicas

Replication Logs

Originally STATEMENT-BASED, default to LOGICAL(row-based) if any nondeterminism in statement

WAL

Commit Log, just like the WAL in HBase, however, the write doesn’t wait for finishing writing to in-memory store

Op Log, should be STATEMENT-BASED with transforms

No Log, copy shards initially, and then forward changed documents to sync between primary shard and replicas

The topic is actually a partition-ed log, brokers having follower replicas receive messages from other brokers having corresponding leader replicas using log offset just like how the client consumes messages.

Multi-Write Conflict Resolve

NA (as all writes are sent to leader)

NA (as writes are region-based, no conflict)

LWW (last write win)

NA (as write are shard(partition) based, no conflict)

Partition

Partitioning Strategy

Key Range

First Key Hash, left Key Range

Key range before 2.4, hash and range key both support later on

Key Hash

Decided on the producer side in the topic bias, producer can choose to do hash-based partitioning(which is default) or implement its own partiton strategy. Once it is chosen, it can not be changed (which means the partition number is fixed no matter how many nodes you add later on).

Secondary Indexes

No secondary index by default

Local

Global(term-partitioned)

Local

Rebalancing Strategy

Dynamic Partitioning

Partitioning proportionally to nodes, move split partitions between

Number of partitions equals to that of replica sets, one partition has a lot of 64MB-size chunks, partitions could be added later one and the number of chunks will be re-balanced across partitions (shards)

Fixed number of partitions per index, entire partitions moved between nodes

Manual, need to use tool "kafka-reassign-partitions.sh" to do the partition rebalance. On the new broker, new replicas are created first and then old replicas are removed. Removing many partitions from a single broker, such as if that broker is being removed from the cluster, it is a best practice to shut down and restart the broker before starting the reassignment. This will move leadership for the partitions on that particular broker to other brokers in the cluster. This can significantly increase the per‐ formance of reassignments and reduce the impact on the cluster as the replication traffic will be distributed to many brokers.

RabbitMQ

Request Routing

Routing Tier(ZK), if no cache on client, meta table looking-up in zk first and then the region server is required meta table could be cached in client

Client request to any node and then forward if miss

DynamoDB

Routing Tier (multiple mongos to route and aggregate, and one config server to store data location information(on which partition))

Routing Tier (node with client role)

Partition aware client Producer knows which broker to sent partitioned message to, and consumer knows which partitions he is responsible for receiving the messages from

RabbitMQ

CAP

Note	CAP theorem is actually widely misunderstood, please refer to links: "Please stop calling databases CP or AP" & "What is the CAP theorem?" for clarification. However, it is still useful or relevant to most people for understanding the characteristics of those data systems

MySQL	HBase	Cassandra	DynamoDB	MongoDB	ElasticSearch	Kafka	RabbitMQ
Master-Slave: AP, Cluster: CP (I question that Ma-Sla is only P, as when master goes down, during master election, data is not available for read and write)	CP	AP, Eventually C	AP, Eventually C	P, Not A (during failure-over election), Not C (as async replica sync)	P, Not A (during the promotion of primary shards), Not C (as async replica sync)	CA (Author stats CA, however I think it is a CP system, system can tolerant node failure, however during the failure, primary partition has to be elected, in that period, data in that partition is not available to read )	RabbitMQ

TODO

Add more data systems: tidb, zookeeper, etcd, consule
Add "Read behavior","Dependencies", "Consensus Algorithm", "Distributed Transaction" in the table

Reference

Designing Data-Intensive Applications (https://dataintensive.net/)
MongoDB: The Definitive Guide, 2nd Edition (http://shop.oreilly.com/product/0636920028031.do)
The MongoDB 4.0 Manual (https://docs.mongodb.com/manual/)
Elasticsearch: The Definitive Guide (https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html)
Elasticsearch Reference (https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html)
Cassandra: The Definitive Guide (http://shop.oreilly.com/product/0636920010852.do)
Kafka: The Definitive Guide (http://shop.oreilly.com/product/0636920044123.do)
RabbitMQ in Action (https://www.manning.com/books/rabbitmq-in-action)

chainone / distributed-data-system-comparison

Preface

Table Snapshots

Detailed comparison table

CAP

TODO

Reference

About