super-flat / parti

πŸͺ© parti is cluster sharding via raft over gRPC

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cluster log out of sync

zenyui opened this issue Β· comments

It seems like when we rapidly kill nodes and k8s brings in new ones, the newest nodes correctly inherit the old state, but the new leader's changes to state don't propagate to old nodes.

So, in theory, the log replication TO the new nodes worked somehow (or at least a snapshot) but the later changes that the new leader tries to propagate back to the old node are ignored. We ARE using the "testing only" in memory log store lol, so maybe that's why.

For example, consider the example below:

  1. The cluster has 3 nodes, and node parti-5b88bbfdc8-wzm72 owns partitions [4, 5, 6]
  2. We kill 2 of 3 nodes, leaving only parti-5b88bbfdc8-wzm72 alive
  3. K8s brings online 2 new nodes: parti-5b88bbfdc8-9dg2h and parti-5b88bbfdc8-k4n64
  4. The new 2 nodes correctly show the oldest node as still owning partitions [4, 5, 6] and have correctly rebalanced such that the new nodes inherit the partitions of the dead nodes
  5. The oldest node still shows dead node ID's as owning partitions [0, 1, 2, 3] and [7, 8, 9]

See logs below

➜  parti git:(main) βœ— kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
parti-5b88bbfdc8-9dg2h   1/1     Running   0          3m46s
parti-5b88bbfdc8-k4n64   1/1     Running   0          3m46s
parti-5b88bbfdc8-wzm72   1/1     Running   0          7m26s
➜  parti git:(main) βœ— kubectl exec -it parti-5b88bbfdc8-9dg2h -- sh -c "/app/example stats" 
{
  "nodeId": "evjzpm",
  "isLeader": true,
  "partitionOwners": {
    "0": "evjzpm",
    "1": "evjzpm",
    "2": "evjzpm",
    "3": "evjzpm",
    "4": "ejxbdd",
    "5": "ejxbdd",
    "6": "ejxbdd",
    "7": "tpmotd",
    "8": "tpmotd",
    "9": "tpmotd"
  },
  "peerPorts": {
    "ejxbdd": 50101,
    "evjzpm": 50101,
    "tpmotd": 50101
  }
}
➜  parti git:(main) βœ— kubectl exec -it parti-5b88bbfdc8-k4n64 -- sh -c "/app/example stats" 
{
  "nodeId": "tpmotd",
  "partitionOwners": {
    "0": "tpmotd",
    "1": "tpmotd",
    "2": "tpmotd",
    "3": "tpmotd",
    "4": "ejxbdd",
    "5": "ejxbdd",
    "6": "ejxbdd",
    "7": "tpmotd",
    "8": "tpmotd",
    "9": "tpmotd"
  },
  "peerPorts": {
    "ejxbdd": 50101,
    "evjzpm": 50101,
    "tpmotd": 50101
  }
}
➜  parti git:(main) βœ— kubectl exec -it parti-5b88bbfdc8-wzm72 -- sh -c "/app/example stats" 
{
  "nodeId": "ejxbdd",
  "partitionOwners": {
    "0": "hcxbpo",
    "1": "hcxbpo",
    "2": "hcxbpo",
    "3": "hcxbpo",
    "4": "ejxbdd",
    "5": "ejxbdd",
    "6": "ejxbdd",
    "7": "rpxvpg",
    "8": "rpxvpg",
    "9": "rpxvpg"
  },
  "peerPorts": {
    "ejxbdd": 50101,
    "hcxbpo": 50101,
    "rpxvpg": 50101
  }
}

the first thing to try is if the pre-built log/snapshot stores that write to disk have this issue. if not, we know the culprit, and can either use the on-disk stores, or we can implement our own proper in-mem store.

this seems resolved now!