Raft topology error injections: failed to add a node to a cluster if another bootstrapping node is stuck

Question

Raft topology error injections: failed to add a node to a cluster if another bootstrapping node is stuck

enaydanov opened this issue 24 days ago · comments

It's a failure of a synthetic test implemented as a part of "Randomized Failure Injection for Raft Based Topology" test effort: #16223

The idea of the test is to have a cluster, where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the Raft state machine.

add_new_node cluster event failed to add a node to a cluster if used with following error injections:

stop_after_starting_auth_service
stop_after_setting_mode_to_normal
stop_before_becoming_raft_voter
stop_after_updating_cdc_generation
stop_before_streaming
stop_after_streaming

scylla-6.log (a new node the test is trying to add while node-5 is SIGSTOPed):

...
INFO  2024-05-13 08:08:58,689 [shard 0:strm] raft_group0 - setup_group0: joining group 0...
INFO  2024-05-13 08:08:58,690 [shard 0:strm] raft_group0 - server bb19bc99-4102-43ee-a4a0-c6cc41b6b1bc found no local group 0. Discovering...
INFO  2024-05-13 08:08:58,693 [shard 0:strm] raft_group0 - server bb19bc99-4102-43ee-a4a0-c6cc41b6b1bc found group 0 with group id e442c260-10ff-11ef-ad2b-a8475d9bb3ee, leader 40bc108d-f62e-4209-8d86-7d6485bf3028
INFO  2024-05-13 08:08:58,693 [shard 0:strm] raft_topology - join: sending the join request to 127.193.106.2
INFO  2024-05-13 08:08:58,900 [shard 0:strm] raft_topology - join: request to join placed, waiting for the response from the topology coordinator
INFO  2024-05-13 08:08:58,914 [shard 0:strm] raft_group0 - Server bb19bc99-4102-43ee-a4a0-c6cc41b6b1bc is starting group 0 with id e442c260-10ff-11ef-ad2b-a8475d9bb3ee
DEBUG 2024-05-13 08:08:58,924 [shard 0:strm] raft_topology - reload raft topology state
INFO  2024-05-13 08:08:58,939 [shard 0:strm] raft_group0 - Detected snapshot with index=0, id=d3238a33-19dc-4749-bf24-6c4348fe7c61, triggering new snapshot
WARN  2024-05-13 08:08:58,939 [shard 0:strm] raft_group0 - Could not create new snapshot, there are no entries applied
INFO  2024-05-13 08:09:00,002 [shard 0: gms] gossip - InetAddress 40bc108d-f62e-4209-8d86-7d6485bf3028/127.193.106.2 is now UP, status = NORMAL
INFO  2024-05-13 08:09:00,005 [shard 0: gms] gossip - InetAddress 07803579-905d-44ed-b762-db7c5c172b03/127.193.106.3 is now UP, status = NORMAL
INFO  2024-05-13 08:09:00,006 [shard 0: gms] gossip - InetAddress 580d775b-3e0b-4dcb-ac8f-0e8eb918ce2e/127.193.106.4 is now UP, status = NORMAL
INFO  2024-05-13 08:09:00,008 [shard 0: gms] gossip - InetAddress b04bc736-88ec-4dcc-b839-a51f9de76b57/127.193.106.1 is now UP, status = NORMAL
WARN  2024-05-13 08:09:14,998 [shard 0: gms] gossip - Fail to send EchoMessage to 127.193.106.5: seastar::rpc::timeout_error (rpc call timed out)

After the last message the node-6 just do nothing.

To reproduce these specific failures you need to checkout the PR and change CLUSTER_EVENTS and ERROR_INJECTIONS tuples (in test/topology_experimental_raft/cluster_events.py and test/topology_experimental_raft/error_injections.py files correspondingly) to run just required combination.

add_node.tar.gz