cluster log out of sync

Question

cluster log out of sync

zenyui opened this issue 2 years ago · comments

It seems like when we rapidly kill nodes and k8s brings in new ones, the newest nodes correctly inherit the old state, but the new leader's changes to state don't propagate to old nodes.

So, in theory, the log replication TO the new nodes worked somehow (or at least a snapshot) but the later changes that the new leader tries to propagate back to the old node are ignored. We ARE using the "testing only" in memory log store lol, so maybe that's why.

For example, consider the example below:

The cluster has 3 nodes, and node parti-5b88bbfdc8-wzm72 owns partitions [4, 5, 6]
We kill 2 of 3 nodes, leaving only parti-5b88bbfdc8-wzm72 alive
K8s brings online 2 new nodes: parti-5b88bbfdc8-9dg2h and parti-5b88bbfdc8-k4n64
The new 2 nodes correctly show the oldest node as still owning partitions [4, 5, 6] and have correctly rebalanced such that the new nodes inherit the partitions of the dead nodes
The oldest node still shows dead node ID's as owning partitions [0, 1, 2, 3] and [7, 8, 9]

See logs below

➜  parti git:(main) ✗ kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
parti-5b88bbfdc8-9dg2h   1/1     Running   0          3m46s
parti-5b88bbfdc8-k4n64   1/1     Running   0          3m46s
parti-5b88bbfdc8-wzm72   1/1     Running   0          7m26s
➜  parti git:(main) ✗ kubectl exec -it parti-5b88bbfdc8-9dg2h -- sh -c "/app/example stats" 
{
  "nodeId": "evjzpm",
  "isLeader": true,
  "partitionOwners": {
    "0": "evjzpm",
    "1": "evjzpm",
    "2": "evjzpm",
    "3": "evjzpm",
    "4": "ejxbdd",
    "5": "ejxbdd",
    "6": "ejxbdd",
    "7": "tpmotd",
    "8": "tpmotd",
    "9": "tpmotd"
  },
  "peerPorts": {
    "ejxbdd": 50101,
    "evjzpm": 50101,
    "tpmotd": 50101
  }
}
➜  parti git:(main) ✗ kubectl exec -it parti-5b88bbfdc8-k4n64 -- sh -c "/app/example stats" 
{
  "nodeId": "tpmotd",
  "partitionOwners": {
    "0": "tpmotd",
    "1": "tpmotd",
    "2": "tpmotd",
    "3": "tpmotd",
    "4": "ejxbdd",
    "5": "ejxbdd",
    "6": "ejxbdd",
    "7": "tpmotd",
    "8": "tpmotd",
    "9": "tpmotd"
  },
  "peerPorts": {
    "ejxbdd": 50101,
    "evjzpm": 50101,
    "tpmotd": 50101
  }
}
➜  parti git:(main) ✗ kubectl exec -it parti-5b88bbfdc8-wzm72 -- sh -c "/app/example stats" 
{
  "nodeId": "ejxbdd",
  "partitionOwners": {
    "0": "hcxbpo",
    "1": "hcxbpo",
    "2": "hcxbpo",
    "3": "hcxbpo",
    "4": "ejxbdd",
    "5": "ejxbdd",
    "6": "ejxbdd",
    "7": "rpxvpg",
    "8": "rpxvpg",
    "9": "rpxvpg"
  },
  "peerPorts": {
    "ejxbdd": 50101,
    "hcxbpo": 50101,
    "rpxvpg": 50101
  }
}

Zen Yui · Answer 1 · Wed Nov 30 2022 13:33:51 GMT+0800 (China Standard Time)

the first thing to try is if the pre-built log/snapshot stores that write to disk have this issue. if not, we know the culprit, and can either use the on-disk stores, or we can implement our own proper in-mem store.

Zen Yui · Answer 2 · Sun Feb 05 2023 10:32:28 GMT+0800 (China Standard Time)

this seems resolved now!