Node bootstrap failed with error: "the topology coordinator rejected request to join the cluster: request canceled because some required nodes are dead""

Question

Node bootstrap failed with error: "the topology coordinator rejected request to join the cluster: request canceled because some required nodes are dead""

timtimb0t opened this issue a month ago · comments

Yauheni Khatsianevich commented a month ago

Packages

Scylla version: 2025.1.0~rc2-20250216.6ee17795783f with build-id 8fc682bcfdf0a8cd9bc106a5ecaa68dce1c63ef6

Kernel Version: 6.8.0-1021-aws

Issue description

During disrupt_decommission_streaming_err nemesis coordinator node been chosen as target node for decommission. The decommission process started and coordinator node reported:

2025-02-18T15:53:18.936+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-1     !INFO | scylla[5530]:  [shard 0: gms] raft_topology - coordinator is decommissioning and becomes a non-voter; giving up leadership
2025-02-18T15:53:18.936+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-1     !INFO | scylla[5530]:  [shard 0: gms] raft_group0 - becoming a non-voter (my id = 04af9f5f-5f97-4eaa-960b-71703ffba331)...
2025-02-18T15:53:18.936+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-1     !INFO | scylla[5530]:  [shard 0: gms] raft_group0 - became a non-voter.
2025-02-18T15:53:18.936+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-1     !INFO | scylla[5530]:  [shard 0:strm] raft_group0 - losing leadership

at the same time node 2 reported that it gained the leadership:

< t:2025-02-18 15:53:19,368 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2025-02-18T15:53:19.318+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0:strm] raft_group0 - gaining leadership
< t:2025-02-18 15:53:19,368 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2025-02-18T15:53:19.319+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0:strm] raft_topology - start topology coordinator fiber
< t:2025-02-18 15:53:19,368 f:db_log_reader.py l:125  c:sdcm.db_log_reader   p:DEBUG > 2025-02-18T15:53:19.319+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] raft_topology - updating topology state: Starting new topology coordinator a3eaf90e-c343-4846-abb5-6d712aef3519

Nemesis successfully interrupted the decommission process with rebooting the node and it returned to the cluster:

2025-02-18T14:13:11.818+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] raft_topology - coordinator accepted request to join, waiting for nodes [04af9f5f-5f97-4eaa-960b-71703ffba331] to be alive before responding and continuing
2025-02-18T14:13:12.428+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0:strm] group0_tombstone_gc_handler - Setting reconcile time to   1739887990 (min id=7c1e8116-ee02-11ef-4691-7c497b721f5a)
2025-02-18T14:13:12.428+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] gossip - InetAddress 04af9f5f-5f97-4eaa-960b-71703ffba331/2a05:d018:12e3:f000:e91:e111:135f:93fd is now UP, status = NORMAL
2025-02-18T14:13:12.428+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] raft_topology - nodes [04af9f5f-5f97-4eaa-960b-71703ffba331] are alive

But then coordinator node marked node that returned to the cluster as down:

2025-02-18T15:53:22.568+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] gossip - InetAddress 04af9f5f-5f97-4eaa-960b-71703ffba331/2a05:d018:12e3:f000:e91:e111:135f:93fd is now DOWN, status = shutdown
2025-02-18T15:53:22.568+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2      !ERR | scylla[5542]:  [shard 0: gms] raft_topology - send_raft_topology_cmd(stream_ranges) failed with exception (node state is decommissioning): seastar::rpc::closed_error (connection is closed)
2025-02-18T15:53:22.568+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] raft_topology - start rolling back topology change
2025-02-18T15:53:22.568+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] raft_topology - rollback 04af9f5f-5f97-4eaa-960b-71703ffba331 after decommissioning failure, moving transition state to rollback to normal and setting cleanup flag
2025-02-18T15:53:22.569+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] raft_topology - updating topology state: rollback 04af9f5f-5f97-4eaa-960b-71703ffba331 after decommissioning failure, moving transition state to rollback to normal and setting cleanup flag
2025-02-18T15:53:22.569+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] raft_topology - entered `rollback to normal` transition state
2025-02-18T15:53:22.569+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0: gms] raft_topology - executing global topology command barrier_and_drain, excluded nodes: {}
2025-02-18T15:53:26.318+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-2     !INFO | scylla[5542]:  [shard 0:main] raft_group_registry - marking Raft server 04af9f5f-5f97-4eaa-960b-71703ffba331 as dead for raft groups

At the same time, within this nemesis the new node was being added but bootstrap process failed with error:

2025-02-18T15:59:10.184+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-15      !ERR | scylla[5579]:  [shard 0:main] init - Startup failed: std::runtime_error (the topology coordinator rejected request to join the cluster: request canceled because some required nodes are dead)

Impact

It seems that the node was lost from the cluster, with no possibility of adding a new one

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 12 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

parallel-topology-schema-changes-mu-db-node-9603c5ec-9 (3.8.90.244 | 2a05:d01c:0964:7d01:a280:6b51:4349:5d63) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-8 (35.176.254.136 | 2a05:d01c:0964:7d00:93a3:14bb:8761:8466) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-7 (18.171.207.44 | 2a05:d01c:0964:7d00:390d:3141:febc:9ca6) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-6 (34.250.253.228 | 2a05:d018:12e3:f002:9b03:a289:9188:121e) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-5 (34.242.117.26 | 2a05:d018:12e3:f002:3d33:8df7:1205:239b) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-4 (34.241.176.122 | 2a05:d018:12e3:f001:2a47:5be5:96b0:e220) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-3 (3.254.146.142 | 2a05:d018:12e3:f001:1083:0c03:af45:c941) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-2 (54.155.86.187 | 2a05:d018:12e3:f000:b330:96eb:6ad3:58f7) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-15 (54.217.10.35 | 2a05:d018:12e3:f000:fdb0:a7a2:746c:ae39) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-14 (34.249.104.146 | 2a05:d018:12e3:f002:8109:2732:9e81:efe4) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-13 (13.40.155.212 | 2a05:d01c:0964:7d02:e4d4:5718:a802:51f7) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-12 (18.169.104.21 | 2a05:d01c:0964:7d02:1581:463d:d5d1:4792) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-11 (18.169.133.149 | 2a05:d01c:0964:7d02:397f:8cbd:55e1:2103) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-10 (18.175.200.62 | 2a05:d01c:0964:7d01:0aef:1c67:a1a7:ea2a) (shards: 7)
parallel-topology-schema-changes-mu-db-node-9603c5ec-1 (3.249.249.119 | 2a05:d018:12e3:f000:0e91:e111:135f:93fd) (shards: 7)

OS / Image: ami-089e047033a16995a ami-0c34f939e95d0c640 (aws: undefined_region)

Test: longevity-multidc-schema-topology-changes-12h-test
Test id: 9603c5ec-ad38-449a-aa85-b91ff235b5d8
Test name: scylla-2025.1/vnodes/tier1/longevity-multidc-schema-topology-changes-12h-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

longevity-multidc-parallel-topology-schema-changes-12h.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 9603c5ec-ad38-449a-aa85-b91ff235b5d8
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 9603c5ec-ad38-449a-aa85-b91ff235b5d8

Logs:

parallel-topology-schema-changes-mu-db-node-9603c5ec-12 - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_140800/parallel-topology-schema-changes-mu-db-node-9603c5ec-12-9603c5ec.tar.zst
parallel-topology-schema-changes-mu-db-node-9603c5ec-6 - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_140800/parallel-topology-schema-changes-mu-db-node-9603c5ec-6-9603c5ec.tar.zst
parallel-topology-schema-changes-mu-db-node-9603c5ec-1 - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_140800/parallel-topology-schema-changes-mu-db-node-9603c5ec-1-9603c5ec.tar.zst
db-cluster-9603c5ec.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_161942/db-cluster-9603c5ec.tar.zst
sct-runner-events-9603c5ec.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_161942/sct-runner-events-9603c5ec.tar.zst
sct-9603c5ec.log.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_161942/sct-9603c5ec.log.tar.zst
loader-set-9603c5ec.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_161942/loader-set-9603c5ec.tar.zst
monitor-set-9603c5ec.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_161942/monitor-set-9603c5ec.tar.zst
ssl-conf-9603c5ec.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/20250218_161942/ssl-conf-9603c5ec.tar.zst
builder-9603c5ec.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9603c5ec-ad38-449a-aa85-b91ff235b5d8/upload_20250218_162133/builder-9603c5ec.log.tar.gz

Jenkins job URL
Argus

Yaniv Kaul · Answer 1 · Sun Feb 23 2025 19:07:56 GMT+0800 (China Standard Time)

some required nodes - were they?
@timtimb0t - what was the status of all other nodes at the same time?

Yaniv Kaul · Answer 2 · Sun Feb 23 2025 19:09:39 GMT+0800 (China Standard Time)

@gleb-cloudius - can you please take a look?

Gleb Natapov · Answer 3 · Sun Feb 23 2025 20:05:41 GMT+0800 (China Standard Time)

node1 seams to be down. First of all there is no system.log in the parallel-topology-schema-changes-mu-db-node-9603c5ec-1/ which is AFAIK indicates that the node was down when logs were collected (and we need it since it has more info then messages.log). Second there are these messages in the node2 log:

Feb 18 15:56:51.784302 parallel-topology-schema-changes-mu-db-node-9603c5ec-2 scylla[5542]:  [shard 0: gms] gossip - Got shutdown message from 2a05:d018:12e3:f000:e91:e111:135f:93fd, received_generation=1739894152, local_generation=1739894152
Feb 18 15:56:51.784726 parallel-topology-schema-changes-mu-db-node-9603c5ec-2 scylla[5542]:  [shard 0: gms] gossip - InetAddress 04af9f5f-5f97-4eaa-960b-71703ffba331/2a05:d018:12e3:f000:e91:e111:135f:93fd is now DOWN, status = shutdown

Cancellation happened at 15:59:08.207, so after that.

And third the last line of the log on node1 is:

2025-02-18T15:56:51.670+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-1   !NOTICE | syslog-ng[889]: syslog-ng shutting down; version='4.3.1'

Exactly the same time as node2 got shutdown message from it.

Some, probably unrelated, but still issues that I saw it that after reboot node1 did not manage to start Scylla right away. First attempt failed with:

2025-02-18T15:53:48.382+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-1      !ERR | scylla[836]:  [shard 0:main] init - Startup failed: std::system_error (error system:99, posix_listen failed for address [2a05:d018:12e3:f000:e91:e111:135f:93fd]:9180: Cannot assign requested address)

Second is messages like:

2025-02-18T15:53:21.419+00:00 parallel-topology-schema-changes-mu-db-node-9603c5ec-1  !WARNING | scylla[5530]:  [shard 0:strm] seastar - Too long queue accumulated for streaming (3072 tasks)

while streaming.

Yauheni Khatsianevich · Answer 4 · Mon Feb 24 2025 16:49:06 GMT+0800 (China Standard Time)

some required nodes - were they? @timtimb0t - what was the status of all other nodes at the same time?

All other nodes were UN

Gleb Natapov · Answer 5 · Mon Feb 24 2025 17:05:29 GMT+0800 (China Standard Time)

some required nodes - were they? @timtimb0t - what was the status of all other nodes at the same time?

All other nodes were UN

Except of 04af9f5f-5f97-4eaa-960b-71703ffba331. This one was pining for the fjords. From the sct-9603c5ec.log:

< t:2025-02-18 15:56:51,438 f:cluster.py      l:1254 c:sdcm.cluster         p:INFO  > Node parallel-topology-schema-changes-mu-db-node-9603c5ec-1 [3.249.249.119 | 10.4.0.168 | 2a05:d018:12e3:f000:0e91:e111:135f:93fd] (dc name: eu-westscylla_node_west, rack: 1a) destroyed

Yauheni Khatsianevich · Answer 6 · Mon Feb 24 2025 17:17:50 GMT+0800 (China Standard Time)

some required nodes - were they? @timtimb0t - what was the status of all other nodes at the same time?

All other nodes were UN

Except of 04af9f5f-5f97-4eaa-960b-71703ffba331. This one was pining for the fjords. From the sct-9603c5ec.log:
< t:2025-02-18 15:56:51,438 f:cluster.py      l:1254 c:sdcm.cluster         p:INFO  > Node parallel-topology-schema-changes-mu-db-node-9603c5ec-1 [3.249.249.119 | 10.4.0.168 | 2a05:d018:12e3:f000:0e91:e111:135f:93fd] (dc name: eu-westscylla_node_west, rack: 1a) destroyed

Yes, the first node is the problematic default coordinator node that was banned by new coordinator (node 2)

Gleb Natapov · Answer 7 · Mon Feb 24 2025 17:22:00 GMT+0800 (China Standard Time)

some required nodes - were they? @timtimb0t - what was the status of all other nodes at the same time?

All other nodes were UN

Except of 04af9f5f-5f97-4eaa-960b-71703ffba331. This one was pining for the fjords. From the sct-9603c5ec.log:
< t:2025-02-18 15:56:51,438 f:cluster.py      l:1254 c:sdcm.cluster         p:INFO  > Node parallel-topology-schema-changes-mu-db-node-9603c5ec-1 [3.249.249.119 | 10.4.0.168 | 2a05:d018:12e3:f000:0e91:e111:135f:93fd] (dc name: eu-westscylla_node_west, rack: 1a) destroyed
Yes, the first node is the problematic default coordinator node that was banned by new coordinator (node 2)

I do not understand what you mean here. You have 14 nodes. 13 up 1 down. You want to bootstrap node 15 which fails because all nodes should be up. There is no bug here. This is expected behaviour.

Yauheni Khatsianevich · Answer 8 · Mon Feb 24 2025 18:15:14 GMT+0800 (China Standard Time)

@gleb-cloudius , the sequence was as follows:

Coordinator node (node1) been chosen for decommission and lose leadership
Node 2 gained the leadership
Node1 interrupted the decommission process and returned to the cluster
Coordinator node (node2) marked node1 as down despite it returned to the cluster
New node adding process failed due to node1 been marked as down

As result cluster lose 1 node and never added new one.

Gleb Natapov · Answer 9 · Mon Feb 24 2025 18:21:01 GMT+0800 (China Standard Time)

@gleb-cloudius , the sequence was as follows:

1. Coordinator node (node1) been chosen for decommission and lose leadership

2. Node 2 gained the leadership

3. Node1 interrupted the decommission process and returned to the cluster

4. Coordinator node (node2) marked node1 as down despite it returned to the cluster

5. New node adding process failed due to node1 been marked as down

As result cluster lose 1 node and never added new one.

According to all the evidence here #22983 (comment) this is not what happened. At step 4 node1 is dead. In fact it sends shutdown message to node2. sct log shows it as destroyed as can be seen here #22983 (comment).