c-s load failed during cluster rolling restart - failed to get QUORUM, not enough replicas available

Question

c-s load failed during cluster rolling restart - failed to get QUORUM, not enough replicas available

juliayakovlev opened this issue 24 days ago · comments

Packages

Scylla version: 5.5.0~dev-20240510.28791aa2c1d3 with build-id 893c2a68becf3d3bcbbf076980b1b831b9b76e29
Kernel Version: 5.15.0-1060-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Cassandra-stress load (writes and reads) failed while disrupt_rolling_restart_cluster - failed to get QUORUM, not enough replicas available

2024-05-12 08:10:15.584: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=ecab70e9-3b78-4097-88c6-ad7618d462e6 during_nemesis=RollingRestartCluster: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=20651 node=Node longevity-tls-50gb-3d-master-loader-node-a6bbb535-2 [18.201.34.63 | 10.4.8.239]
java.io.IOException: Operation x10 on key(s) [4e354f50393938333930]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)

2024-05-12 08:10:15.652: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=b10af2ee-103b-42b3-aa86-9f96e247611c during_nemesis=RollingRestartCluster: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=20691 node=Node longevity-tls-50gb-3d-master-loader-node-a6bbb535-3 [3.248.184.200 | 10.4.10.124]
java.io.IOException: Operation x10 on key(s) [4c304d4e313534343730]: Error executing: (ReadFailureException): Cassandra failure during read query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)

2024-05-12 08:10:15.784: (CassandraStressLogEvent Severity.CRITICAL) period_type=one-time event_id=b10af2ee-103b-42b3-aa86-9f96e247611c during_nemesis=RollingRestartCluster: type=OperationOnKey regex=Operation x10 on key\(s\) \[ line_number=20713 node=Node longevity-tls-50gb-3d-master-loader-node-a6bbb535-3 [3.248.184.200 | 10.4.10.124]
java.io.IOException: Operation x10 on key(s) [304b4e3436304d323131]: Error executing: (WriteFailureException): Cassandra failure during write query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)

This nemesis restarts Scylla on all nodes (one by one) by running sudo systemctl stop scylla-server.service and then sudo systemctl start scylla-server.service.
Nodes order to restart:

'longevity-tls-50gb-3d-master-db-node-a6bbb535-3', 
'longevity-tls-50gb-3d-master-db-node-a6bbb535-4', 
'longevity-tls-50gb-3d-master-db-node-a6bbb535-5', 
'longevity-tls-50gb-3d-master-db-node-a6bbb535-6', 
'longevity-tls-50gb-3d-master-db-node-a6bbb535-8', 
'longevity-tls-50gb-3d-master-db-node-a6bbb535-9'

The load failures happened after longevity-tls-50gb-3d-master-db-node-a6bbb535-6 was restarted and initialisation was completed.
During Scylla start very high foreground writes are observed on the longevity-tls-50gb-3d-master-db-node-a6bbb535-6. Writes started to fail while Scylla stop.

where red line - is longevity-tls-50gb-3d-master-db-node-a6bbb535-6 node.

Reactor stalls (32ms) and kernel callstacks

May 12 08:10:09.301420 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 12. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x32c21 0x21e52 0x68484 0xe059 0x14976 0x14bfd 0x14d09 0x169fb7 0x85d22 0x86121 0x6de02 0x7da2c 0x6332334 0x632ab1b 0x632ae5f 0x632b3a6 0x20fe8b1 0x290118d 0x2900905 0x5e8a29f 0x5e8b587 0x5eaf4c0 0x5e4abda 0x8c946 0x11296f

void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./seastar/src/core/reactor.cc:839
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:858
seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./seastar/src/core/reactor.cc:1482
seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./seastar/src/core/reactor.cc:1219
 (inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./seastar/src/core/reactor.cc:1239
 (inlined by) seastar::reactor::block_notifier(int) at ./build/release/seastar/./seastar/src/core/reactor.cc:1520
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
?? ??:0
seastar::tls::certificate_credentials::impl::set_x509_key(std::basic_string_view<char, std::char_traits<char> > const&, std::basic_string_view<char, std::char_traits<char> > const&, seastar::tls::x509_crt_format) at ./build/release/seastar/./seastar/src/net/tls.cc:399
operator() at ./build/release/seastar/./seastar/src/net/tls.cc:725
 (inlined by) operator()<seastar::x509_key> at ./build/release/seastar/./seastar/src/net/tls.cc:705
 (inlined by) void seastar::visit_blobs<std::multimap<seastar::basic_sstring<char, unsigned int, 15u, true>, boost::any, std::less<seastar::basic_sstring<char, unsigned int, 15u, true> >, std::allocator<std::pair<seastar::basic_sstring<char, unsigned int, 15u, true> const, boost::any> > > const, seastar::internal::variant_visitor<seastar::tls::credentials_builder::apply_to(seastar::tls::certificate_credentials&) const::$_0, seastar::tls::credentials_builder::apply_to(seastar::tls::certificate_credentials&) const::$_1, seastar::tls::credentials_builder::apply_to(seastar::tls::certificate_credentials&) const::$_2> >(std::multimap<seastar::basic_sstring<char, unsigned int, 15u, true>, boost::any, std::less<seastar::basic_sstring<char, unsigned int, 15u, true> >, std::allocator<std::pair<seastar::basic_sstring<char, unsigned int, 15u, true> const, boost::any> > > const&, seastar::internal::variant_visitor<seastar::tls::credentials_builder::apply_to(seastar::tls::certificate_credentials&) const::$_0, seastar::tls::credentials_builder::apply_to(seastar::tls::certificate_credentials&) const::$_1, seastar::tls::credentials_builder::apply_to(seastar::tls::certificate_credentials&) const::$_2>&&) at ./build/release/seastar/./seastar/src/net/tls.cc:710
 (inlined by) seastar::tls::credentials_builder::apply_to(seastar::tls::certificate_credentials&) const at ./build/release/seastar/./seastar/src/net/tls.cc:716
seastar::tls::credentials_builder::build_server_credentials() const at ./build/release/seastar/./seastar/src/net/tls.cc:768
seastar::tls::credentials_builder::build_reloadable_server_credentials(std::function<void (std::unordered_set<seastar::basic_sstring<char, unsigned int, 15u, true>, std::hash<seastar::basic_sstring<char, unsigned int, 15u, true> >, std::equal_to<seastar::basic_sstring<char, unsigned int, 15u, true> >, std::allocator<seastar::basic_sstring<char, unsigned int, 15u, true> > > const&, std::__exception_ptr::exception_ptr)>, std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) const at ./build/release/seastar/./seastar/src/net/tls.cc:1033
generic_server::server::listen(seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>) at ./generic_server.cc:151
operator() at ./transport/controller.cc:72
 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, cql_transport::cql_server&>(std::__invoke_other, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, cql_transport::cql_server&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
 (inlined by) std::__invoke_result<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, cql_transport::cql_server&>::type std::__invoke<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, cql_transport::cql_server&>(cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, cql_transport::cql_server&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:96
 (inlined by) decltype(auto) std::__apply_impl<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, std::tuple<cql_transport::cql_server&>, 0ul>(cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, std::tuple<cql_transport::cql_server&>&&, std::integer_sequence<unsigned long, 0ul>) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/tuple:2288
 (inlined by) decltype(auto) std::apply<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, std::tuple<cql_transport::cql_server&> >(cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, std::tuple<cql_transport::cql_server&>&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/tuple:2299
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::apply<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, cql_transport::cql_server&>(cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, std::tuple<cql_transport::cql_server&>&&) at ././seastar/include/seastar/core/future.hh:2003
 (inlined by) auto seastar::futurize_apply<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, cql_transport::cql_server&>(cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0&, std::tuple<cql_transport::cql_server&>&&) at ././seastar/include/seastar/core/future.hh:2078
 (inlined by) operator() at ././seastar/include/seastar/core/sharded.hh:766
 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}>(std::__invoke_other, seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
 (inlined by) std::__invoke_result<seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}>::type std::__invoke<seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}>(seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:96
 (inlined by) decltype(auto) std::__apply_impl<seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}, std::tuple<>>(seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}&&, std::tuple<>&&, std::integer_sequence<unsigned long>) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/tuple:2288
 (inlined by) decltype(auto) std::apply<seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}, std::tuple<> >(seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}::operator()(cql_transport::cql_server&)::{lambda()#1}&&, std::tuple<>&&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/tuple:2299
 (inlined by) operator() at ././seastar/include/seastar/core/sharded.hh:765
 (inlined by) seastar::future<void> std::__invoke_impl<seastar::future<void>, seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}&, cql_transport::cql_server&>(std::__invoke_other, seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}&, cql_transport::cql_server&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<seastar::future<void>, seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}&, cql_transport::cql_server&>, seastar::future<void> >::type std::__invoke_r<seastar::future<void>, seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}&, cql_transport::cql_server&>(std::enable_if&&, (seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}&)...) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:114
 (inlined by) std::_Function_handler<seastar::future<void> (cql_transport::cql_server&), seastar::sharded<cql_transport::cql_server>::invoke_on_all<cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0>(seastar::smp_submit_to_options, cql_transport::listen_on_all_shards(seastar::sharded<cql_transport::cql_server>&, seastar::socket_address, std::shared_ptr<seastar::tls::credentials_builder>, bool, bool, std::optional<seastar::file_permissions>)::$_0)::{lambda(cql_transport::cql_server&)#1}>::_M_invoke(std::_Any_data const&, cql_transport::cql_server&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:290
std::function<seastar::future<void> (cql_transport::cql_server&)>::operator()(cql_transport::cql_server&) const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
 (inlined by) operator() at ././seastar/include/seastar/core/sharded.hh:747
 (inlined by) seastar::future<void> seastar::futurize<seastar::future<void> >::invoke<seastar::sharded<cql_transport::cql_server>::invoke_on_all(seastar::smp_submit_to_options, std::function<seastar::future<void> (cql_transport::cql_server&)>)::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&>(seastar::sharded<cql_transport::cql_server>::invoke_on_all(seastar::smp_submit_to_options, std::function<seastar::future<void> (cql_transport::cql_server&)>)::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}&) at ././seastar/include/seastar/core/future.hh:2035
 (inlined by) seastar::smp_message_queue::async_work_item<seastar::sharded<cql_transport::cql_server>::invoke_on_all(seastar::smp_submit_to_options, std::function<seastar::future<void> (cql_transport::cql_server&)>)::{lambda(unsigned int)#1}::operator()(unsigned int) const::{lambda()#1}>::run_and_dispose() at ././seastar/include/seastar/core/smp.hh:249
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2690
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3152
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3320
operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:4563
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:111
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:290
std::function<void ()>::operator()() const at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
 (inlined by) seastar::posix_thread::start_routine(void*) at ./build/release/seastar/./seastar/src/core/posix.cc:90
?? ??:0
?? ??:0

kallsyms_20240512_075635_result.log

Impact

Load failed

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-master-db-node-a6bbb535-9 (3.255.115.235 | 10.4.8.139) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a6bbb535-8 (34.244.47.138 | 10.4.9.0) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a6bbb535-7 (34.242.230.228 | 10.4.9.166) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a6bbb535-6 (52.51.27.26 | 10.4.8.53) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a6bbb535-5 (52.16.209.2 | 10.4.8.183) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a6bbb535-4 (18.201.155.139 | 10.4.10.52) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a6bbb535-3 (3.249.199.72 | 10.4.11.206) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a6bbb535-2 (34.255.217.93 | 10.4.11.113) (shards: 14)
longevity-tls-50gb-3d-master-db-node-a6bbb535-1 (3.254.116.202 | 10.4.8.49) (shards: 14)

OS / Image: ami-0b7480423a402aa95 (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: a6bbb535-3cf6-4f8b-b742-40ef856170ea
Test name: scylla-master/tier1/longevity-50gb-3days-test
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor a6bbb535-3cf6-4f8b-b742-40ef856170ea
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs a6bbb535-3cf6-4f8b-b742-40ef856170ea

Logs:

db-cluster-a6bbb535.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/db-cluster-a6bbb535.tar.gz
sct-runner-events-a6bbb535.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/sct-runner-events-a6bbb535.tar.gz
sct-a6bbb535.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/sct-a6bbb535.log.tar.gz
loader-set-a6bbb535.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/loader-set-a6bbb535.tar.gz
monitor-set-a6bbb535.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/monitor-set-a6bbb535.tar.gz

Jenkins job URL
Argus

Yaniv Kaul · Answer 1 · Mon May 13 2024 20:57:26 GMT+0800 (China Standard Time)

That reactor stall is not new (see #13758 (comment) and https://github.com/scylladb/scylla-enterprise/issues/3963#issue-2161024203 - and I remember (but can't find right now!) more.

Yaniv Kaul · Answer 2 · Mon May 13 2024 21:00:14 GMT+0800 (China Standard Time)

Reactor stalls (32ms) and kernel callstacks

@juliayakovlev - where's the kernel stack?

Yaniv Kaul · Answer 3 · Mon May 13 2024 21:01:08 GMT+0800 (China Standard Time)

@juliayakovlev - what encryption was configured here, btw? Client <-> server? server <-> server? both?

Julia Yakovlev · Answer 4 · Mon May 13 2024 21:02:52 GMT+0800 (China Standard Time)

Reactor stalls (32ms) and kernel callstacks

@juliayakovlev - where's the kernel stack?

If do you mean original file - it's in the node logs (https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/db-cluster-a6bbb535.tar.gz)
If you are searching for decoded file - I attached it in the issue description

Julia Yakovlev · Answer 5 · Mon May 13 2024 21:03:16 GMT+0800 (China Standard Time)

@juliayakovlev - what encryption was configured here, btw? Client <-> server? server <-> server? both?

both

Yaniv Kaul · Answer 6 · Mon May 13 2024 21:08:38 GMT+0800 (China Standard Time)

Reactor stalls (32ms) and kernel callstacks

@juliayakovlev - where's the kernel stack?

If do you mean original file - it's in the node logs (https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/db-cluster-a6bbb535.tar.gz) If you are searching for decoded file - I attached it in the issue description

I couldn't find a single kernel stack in the logs. All empty?

Julia Yakovlev · Answer 7 · Mon May 13 2024 21:17:20 GMT+0800 (China Standard Time)

Reactor stalls (32ms) and kernel callstacks

@juliayakovlev - where's the kernel stack?

If do you mean original file - it's in the node logs (https://cloudius-jenkins-test.s3.amazonaws.com/a6bbb535-3cf6-4f8b-b742-40ef856170ea/20240512_082401/db-cluster-a6bbb535.tar.gz) If you are searching for decoded file - I attached it in the issue description

I couldn't find a single kernel stack in the logs. All empty?

The file is named "kallsyms_20240512_075635" in the longevity-tls-50gb-3d-master-db-node-a6bbb535-6 folder.

In the node log I see only:

May 12 08:10:09.293433 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 2. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f1b1
May 12 08:10:09.293433 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.297418 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 1. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f18b
May 12 08:10:09.297418 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.301420 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 12. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x32c21 0x21e52 0x68484 0xe059 0x14976 0x14bfd 0x14d09 0x169fb7 0x85d22 0x86121 0x6de02 0x7da2c 0x6332334 0x632ab1b 0x632ae5f 0x632b3a6 0x20fe8b1 0x290118d 0x2900905 0x5e8a29f 0x5e8b587 0x5eaf4c0 0x5e4abda 0x8c946 0x11296f
May 12 08:10:09.301420 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.301959 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 10. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f16f
May 12 08:10:09.301959 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.302602 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 8. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f1ac
May 12 08:10:09.302602 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.302994 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 11. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f19e
May 12 08:10:09.302994 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.303774 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 7. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f1ac
May 12 08:10:09.303774 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.304526 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 4. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f19e
May 12 08:10:09.304526 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.304678 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 6. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f1ac
May 12 08:10:09.304678 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.305453 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 13. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f1c2
May 12 08:10:09.305453 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.305529 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 32 ms on shard 9. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f19e
May 12 08:10:09.305529 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:
May 12 08:10:09.305529 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: Reactor stalled for 33 ms on shard 5. Backtrace: 0x5e785fa 0x5e77a05 0x5e78dbf 0x3dbaf 0x6f1c2
May 12 08:10:09.305529 longevity-tls-50gb-3d-master-db-node-a6bbb535-6 scylla[1465]: kernel callstack:

Not sure what it means

Yaniv Kaul · Answer 8 · Mon May 13 2024 21:31:43 GMT+0800 (China Standard Time)

this is exactly what I mean - I don't see any kernel stack.

Israel Fruchter · Answer 9 · Mon May 13 2024 22:10:48 GMT+0800 (China Standard Time)

run from last week (5.5.0~dev-20240501.af5674211dd4):
https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=16ad5b7e-ab08-4d63-bfb3-ca368a4433f5

passed via this nemesis with success

@juliayakovlev, let give it antheor run, to see if reproducible

Michał Chojnowski · Answer 10 · Tue May 14 2024 12:25:05 GMT+0800 (China Standard Time)

java.io.IOException: Operation x10 on key(s) [4c304d4e313534343730]: Error executing: (ReadFailureException): Cassandra failure during read query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)

I hate this, I hate this.

This isn't the first (or 100th) time we are debugging why queries failed for unclear reasons.

But Scylla knows very well which replicas were available, which were queried, which failed, and what reasons did they present. Why can't we just make it tell us?

Yaniv Kaul · Answer 11 · Tue May 14 2024 15:13:16 GMT+0800 (China Standard Time)

java.io.IOException: Operation x10 on key(s) [4c304d4e313534343730]: Error executing: (ReadFailureException): Cassandra failure during read query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)

I hate this, I hate this.

This isn't the first (or 100th) time we are debugging why queries failed for unclear reasons.

But Scylla knows very well which replicas were available, which were queried, which failed, and what reasons did they present. Why can't we just make it tell us?

Do you expect a log on the coordinator for every drop?

Michał Chojnowski · Answer 12 · Tue May 14 2024 15:18:55 GMT+0800 (China Standard Time)

java.io.IOException: Operation x10 on key(s) [4c304d4e313534343730]: Error executing: (ReadFailureException): Cassandra failure during read query at consistency QUORUM (2 responses were required but only 1 replica responded, 1 failed)

I hate this, I hate this.
This isn't the first (or 100th) time we are debugging why queries failed for unclear reasons.
But Scylla knows very well which replicas were available, which were queried, which failed, and what reasons did they present. Why can't we just make it tell us?

Do you expect a log on the coordinator for every drop?

Log? No, I would like more information to be added to the error returned to the client, more than just the number of replicas which failed.

Michał Chojnowski · Answer 13 · Tue May 14 2024 15:29:12 GMT+0800 (China Standard Time)

Also, if cassandra-stress retries each operation 10 times, it should print all 10 errors, not just the last one.

Also, it should report the coordinator. The client knows which coordinator it picked.

All we know from these errors is:
"A write failed 10 times. The last time it failed, 1 replica succeeded and 1 replica responded with an error."

Wouldn't it be better if the error reports narrowed down the problem more? With this, we don't even know if the restarted node was a coordinator or a replica, let alone why the replica failed, or what's up with the third, uncontacted replica.

Yaniv Kaul · Answer 14 · Tue May 14 2024 16:01:09 GMT+0800 (China Standard Time)

The current protocol does not provide more information - https://github.com/apache/cassandra/blob/6bae4f76fb043b4c3a3886178b5650b280e9a50b/doc/native_protocol_v4.spec#L1076
We can extend it of course. And we can probably extend client-side error messages. CC @roydahan

Julia Yakovlev · Answer 15 · Wed May 15 2024 23:07:57 GMT+0800 (China Standard Time)

run from last week (5.5.0~dev-20240501.af5674211dd4): https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=16ad5b7e-ab08-4d63-bfb3-ca368a4433f5

passed via this nemesis with success

@juliayakovlev, let give it antheor run, to see if reproducible

The issue was not reproduced in https://argus.scylladb.com/test/98050732-dfe3-464c-a66a-f235bad30829/runs?additionalRuns[]=9adcc62d-4f9f-4b92-9316-87279f4c1b92 run

Roy Dahan · Answer 16 · Thu May 16 2024 00:22:24 GMT+0800 (China Standard Time)

We can extend it of course. And we can probably extend client-side error messages. CC @roydahan

One day we can improve the tools to provide more information, anyway it's a side tracking to this issue.
We have the key, so we should be able to know what are the replicas and we know which replica is down.

Julia Yakovlev · Answer 17 · Thu May 16 2024 18:34:03 GMT+0800 (China Standard Time)

Reproducer with rolling restart cluster nemesis only.
Issue was reproduced while first nemesis run.

Packages

Scylla version: 5.5.0~dev-20240510.28791aa2c1d3 with build-id 893c2a68becf3d3bcbbf076980b1b831b9b76e29

Kernel Version: 5.15.0-1060-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-repr-iss-db-node-0804442d-6 (3.254.157.122 | 10.4.3.161) (shards: 14)
longevity-tls-50gb-3d-repr-iss-db-node-0804442d-5 (3.248.195.69 | 10.4.1.248) (shards: 14)
longevity-tls-50gb-3d-repr-iss-db-node-0804442d-4 (54.75.96.131 | 10.4.1.55) (shards: 14)
longevity-tls-50gb-3d-repr-iss-db-node-0804442d-3 (52.209.38.69 | 10.4.0.23) (shards: 14)
longevity-tls-50gb-3d-repr-iss-db-node-0804442d-2 (3.250.42.65 | 10.4.0.85) (shards: 14)
longevity-tls-50gb-3d-repr-iss-db-node-0804442d-1 (54.247.198.242 | 10.4.1.187) (shards: 14)

OS / Image: ami-0b7480423a402aa95 (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: 0804442d-781a-4233-8168-7dd3e8896011
Test name: scylla-master/reproducers/longevity-50gb-3days-test
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 0804442d-781a-4233-8168-7dd3e8896011
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 0804442d-781a-4233-8168-7dd3e8896011

Logs:

db-cluster-0804442d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0804442d-781a-4233-8168-7dd3e8896011/20240516_091803/db-cluster-0804442d.tar.gz
sct-runner-events-0804442d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0804442d-781a-4233-8168-7dd3e8896011/20240516_091803/sct-runner-events-0804442d.tar.gz
sct-0804442d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0804442d-781a-4233-8168-7dd3e8896011/20240516_091803/sct-0804442d.log.tar.gz
loader-set-0804442d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0804442d-781a-4233-8168-7dd3e8896011/20240516_091803/loader-set-0804442d.tar.gz
monitor-set-0804442d.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/0804442d-781a-4233-8168-7dd3e8896011/20240516_091803/monitor-set-0804442d.tar.gz

Jenkins job URL
Argus

Yaniv Kaul · Answer 18 · Thu May 16 2024 18:43:11 GMT+0800 (China Standard Time)

@juliayakovlev - anything relevant in the replica logs at the time of failure?

Yaniv Kaul · Answer 19 · Thu May 16 2024 18:43:53 GMT+0800 (China Standard Time)

We can extend it of course. And we can probably extend client-side error messages. CC @roydahan

One day we can improve the tools to provide more information, anyway it's a side tracking to this issue. We have the key, so we should be able to know what are the replicas and we know which replica is down.

@roydahan - please open a tracking issue for this. Sounds like an easy additional log in the stress tool (c-s?) that could help us.

Konstantin Osipov · Answer 20 · Fri May 17 2024 03:13:41 GMT+0800 (China Standard Time)

@kbr-scylla suspect it's a duplicate of #15899, let's fix #15899 and re-test this

Julia Yakovlev · Answer 21 · Sun May 19 2024 23:17:24 GMT+0800 (China Standard Time)

Issue was not reproduced with Scylla version 5.4.6
https://argus.scylladb.com/test/a1c2befc-bd68-457a-ba19-913607256e6f/runs?additionalRuns[]=e0f3aa44-fb22-40a4-b406-91e16ada6c1b

Julia Yakovlev · Answer 22 · Sun May 19 2024 23:17:51 GMT+0800 (China Standard Time)

@juliayakovlev - anything relevant in the replica logs at the time of failure?

I did not find nothing new

Kamil Braun · Answer 23 · Tue May 21 2024 17:35:49 GMT+0800 (China Standard Time)

#18647 (comment)

Reproducer with rolling restart cluster nemesis only.

@juliayakovlev could you please also check if it reproduces on 5.4?

Israel Fruchter · Answer 24 · Tue May 21 2024 17:40:59 GMT+0800 (China Standard Time)

#18647 (comment)

Reproducer with rolling restart cluster nemesis only.

@juliayakovlev could you please also check if it reproduces on 5.4?

@juliayakovlev wrote 2 days ago:

Issue was not reproduced with Scylla version 5.4.6
https://argus.scylladb.com/test/a1c2befc-bd68-457a-ba19-913607256e6f/runs?additionalRuns[]=e0f3aa44-fb22-40a4-b406-91e16ada6c1b

Kamil Braun · Answer 25 · Tue May 21 2024 23:25:09 GMT+0800 (China Standard Time)

Sorry I missed it.

In this case this is a regression and it is not a duplicate of #15899 (which according to the report, happened way back in 5.1)!

I think it's a major issue -- availability disruption during rolling restart.

Giving it P1 priority and release blocker status.

Actually I already have a suspicion what could be the cause: removing wait-for-gossip-to-settle on node restart before "completing initialization" :( (cc @kostja @gleb-cloudius ) 65cfb9b

We should retest with that final wait-for-gossip-to-settle restored (65cfb9b removed two waits -- I believe we only need the second one for preserving availability)

If so, we should consider:

restoring it
or implementing another mechanism (perhaps a faster one) for availability-preserving-rolling-restart. When performing a rolling restart, first node A then node B, we need to make sure that node A is seen as UP and NORMAL by all nodes before we shut down B. There are perhaps smarter ways to do it other than "wait for gossip to settle".

Kamil Braun · Answer 26 · Tue May 21 2024 23:25:25 GMT+0800 (China Standard Time)

Modified original post (this is a regression)

Dor Laor · Answer 27 · Wed May 22 2024 20:32:58 GMT+0800 (China Standard Time)

Reproducer with rolling restart cluster nemesis only. Issue was reproduced while first nemesis run.

I think that servers that were considered restarted and joined the cluster do not
have the capacity of other servers, see the amount of background writes, there are big
differences between the servers.

The gap grows and grows, until eventually there isn't enough capacity and we reach a timeout.
We need to figure out what influence server performance after they were rebooted. Things like
cache affect them. This shouldn't affect writes but it might require read to cache a write too.

It's probably not a regression, just an issue we may have in general, need to research

Packages

Scylla version: 5.5.0~dev-20240510.28791aa2c1d3 with build-id 893c2a68becf3d3bcbbf076980b1b831b9b76e29

Kernel Version: 5.15.0-1060-aws

Issue description

This issue is a regression.

It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-repr-iss-db-node-0804442d-6 (3.254.157.122 | 10.4.3.161) (shards: 14)

longevity-tls-50gb-3d-repr-iss-db-node-0804442d-5 (3.248.195.69 | 10.4.1.248) (shards: 14)

longevity-tls-50gb-3d-repr-iss-db-node-0804442d-4 (54.75.96.131 | 10.4.1.55) (shards: 14)

longevity-tls-50gb-3d-repr-iss-db-node-0804442d-3 (52.209.38.69 | 10.4.0.23) (shards: 14)

longevity-tls-50gb-3d-repr-iss-db-node-0804442d-2 (3.250.42.65 | 10.4.0.85) (shards: 14)

longevity-tls-50gb-3d-repr-iss-db-node-0804442d-1 (54.247.198.242 | 10.4.1.187) (shards: 14)

OS / Image: ami-0b7480423a402aa95 (aws: undefined_region)

Test: longevity-50gb-3days-test Test id: 0804442d-781a-4233-8168-7dd3e8896011 Test name: scylla-master/reproducers/longevity-50gb-3days-test Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Yaniv Kaul · Answer 28 · Sun May 26 2024 23:24:27 GMT+0800 (China Standard Time)

Packages

Scylla version: 6.1.0~dev-20240523.9adf74ae6c7a with build-id 0e61ad9ecb33913aa59e185d2453859c9ed0fd1a

Kernel Version: 5.15.0-1062-aws

Issue description

This issue is a regression.
It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-9 (34.245.159.154 | 10.4.10.101) (shards: 14)
longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-8 (34.240.42.206 | 10.4.11.134) (shards: 14)
longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-7 (34.255.116.175 | 10.4.10.227) (shards: 14)
longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-6 (3.250.46.23 | 10.4.9.1) (shards: 14)
longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-5 (18.200.252.30 | 10.4.8.20) (shards: 14)
longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-4 (34.253.186.166 | 10.4.11.10) (shards: 14)
longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-3 (54.78.205.8 | 10.4.8.206) (shards: 14)
longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-2 (3.249.117.229 | 10.4.10.104) (shards: 14)
longevity-tls-50gb-3d-6-0-db-node-d6d9eca1-1 (3.255.179.210 | 10.4.10.0) (shards: 14)

OS / Image: ami-0927fb8b03edc430c (aws: undefined_region)

Test: longevity-50gb-3days-test
Test id: d6d9eca1-5327-4f35-9588-2b36c644401f
Test name: scylla-6.0/tier1/longevity-50gb-3days-test
Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor d6d9eca1-5327-4f35-9588-2b36c644401f
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs d6d9eca1-5327-4f35-9588-2b36c644401f

Logs:

db-cluster-d6d9eca1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9eca1-5327-4f35-9588-2b36c644401f/20240525_121043/db-cluster-d6d9eca1.tar.gz
sct-runner-events-d6d9eca1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9eca1-5327-4f35-9588-2b36c644401f/20240525_121043/sct-runner-events-d6d9eca1.tar.gz
sct-d6d9eca1.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9eca1-5327-4f35-9588-2b36c644401f/20240525_121043/sct-d6d9eca1.log.tar.gz
loader-set-d6d9eca1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9eca1-5327-4f35-9588-2b36c644401f/20240525_121043/loader-set-d6d9eca1.tar.gz
monitor-set-d6d9eca1.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d6d9eca1-5327-4f35-9588-2b36c644401f/20240525_121043/monitor-set-d6d9eca1.tar.gz

Jenkins job URL
Argus

Gleb Natapov · Answer 29 · Mon May 27 2024 20:29:54 GMT+0800 (China Standard Time)

@juliayakovlev @roydahan how is this rolling restart nemesis decides that it can restart next node?

Israel Fruchter · Answer 30 · Mon May 27 2024 20:54:12 GMT+0800 (China Standard Time)

@juliayakovlev @roydahan how is this rolling restart nemesis decides that it can restart next node?

once node is listening on cql, it moves to the next one

Lukasz Sojka · Answer 31 · Mon May 27 2024 21:01:03 GMT+0800 (China Standard Time)

since couple days we we verify cql port much more often (was every 60 seconds, now every 5), so issue could be emphasized.

Yaniv Kaul · Answer 32 · Mon May 27 2024 21:10:55 GMT+0800 (China Standard Time)

@juliayakovlev @roydahan how is this rolling restart nemesis decides that it can restart next node?

once node is listening on cql, it moves to the next one

There was a long thread about it not being enough and the need for additional checks (scylladb/scylla-ccm#564 implemented this on CCM, and I believe there was a similar issue for dtest?). Specifically, ensure all OTHER nodes see that node as alive and owning its share of the ring?

Gleb Natapov · Answer 33 · Mon May 27 2024 21:20:23 GMT+0800 (China Standard Time)

@juliayakovlev @roydahan how is this rolling restart nemesis decides that it can restart next node?

once node is listening on cql, it moves to the next one

There was a long thread about it not being enough and the need for additional checks (scylladb/scylla-ccm#564 implemented this on CCM, and I believe there was a similar issue for dtest?). Specifically, ensure all OTHER nodes see that node as alive and owning its share of the ring?

Yes, just checking CQL port is not enough. But it worked. We still need to figure out what changed.

Kamil Braun · Answer 34 · Mon May 27 2024 21:43:34 GMT+0800 (China Standard Time)

Yes, just checking CQL port is not enough. But it worked. We still need to figure out what changed.

Before 65cfb9b, CQL port opened meant that gossip has settled. After this commit, it not longer does.

Gleb Natapov · Answer 35 · Mon May 27 2024 22:07:02 GMT+0800 (China Standard Time)

Yes, just checking CQL port is not enough. But it worked. We still need to figure out what changed.

Before 65cfb9b, CQL port opened meant that gossip has settled. After this commit, it not longer does.

I know :) But we did not confirm it yet. Also why gossip settling guaranties that all nodes see all other nodes as alive? May be it is just because it takes time and it does not guaranty it in reality.

Kamil Braun · Answer 36 · Mon May 27 2024 22:09:12 GMT+0800 (China Standard Time)

Also why gossip settling guaranties that all nodes see all other nodes as alive? May be it is just because it takes time and it does not guaranty it in reality.

That's my guess too -- there was no guarantee, but since wait-for-gossip-to-settle always took at least a few seconds, in practice the observable result was that all nodes saw this one as UP before continuing rolling restart on the next node.

Israel Fruchter · Answer 37 · Tue May 28 2024 04:16:27 GMT+0800 (China Standard Time)

a strong sense of dejavu here, around this question.

but what's the next step ? how can a user do a safe rolling restart with this version ?

Gleb Natapov · Answer 38 · Tue May 28 2024 16:30:33 GMT+0800 (China Standard Time)

a strong sense of dejavu here, around this question.

but what's the next step ? how can a user do a safe rolling restart with this version ?

The proper procedure for rolling restart was always to wait for the CQL port and wait for all nodes to see the restarted node as UP.

Kamil Braun · Answer 39 · Tue May 28 2024 16:34:06 GMT+0800 (China Standard Time)

The proper procedure for rolling restart was always to wait for the CQL port and wait for all nodes to see the restarted node as UP.

BTW our docs are vague about it

https://enterprise.docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html

Step 5 says

Verify the node is up and has returned to the Scylla cluster using nodetool status.

but it doesn't say that nodetool status should show UN for this node for every node (so we need to execute nodetool status on every node)

And we have to admit that it's pretty inconvenient to have to connect to every node and execute status there. It's just bad UX.

Gleb Natapov · Answer 40 · Tue May 28 2024 16:38:08 GMT+0800 (China Standard Time)

The proper procedure for rolling restart was always to wait for the CQL port and wait for all nodes to see the restarted node as UP.

BTW our docs are vague about it

https://enterprise.docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html

Step 5 says

Verify the node is up and has returned to the Scylla cluster using nodetool status.

but it doesn't say that nodetool status should show UN for this node for every node (so we need to execute nodetool status on every node)

And we have to admit that it's pretty inconvenient to have to connect to every node and execute status there. It's just bad UX.

The node may do it itself before opening CQL port like it does with shutdown notification, but this is not what "waiting for gossiper to settle" was doing, so this a different feature request.

Lukasz Sojka · Answer 41 · Tue May 28 2024 16:41:22 GMT+0800 (China Standard Time)

The proper procedure for rolling restart was always to wait for the CQL port and wait for all nodes to see the restarted node as UP.

BTW our docs are vague about it

https://enterprise.docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html

Step 5 says

Verify the node is up and has returned to the Scylla cluster using nodetool status.

but it doesn't say that nodetool status should show UN for this node for every node (so we need to execute nodetool status on every node)

And we have to admit that it's pretty inconvenient to have to connect to every node and execute status there. It's just bad UX.

Anyway, SCT for this case don't even do it on single node.
@gleb-cloudius Do we drain nodes before stopping in rolling restart procedure on prod clusters?

Kamil Braun · Answer 42 · Tue May 28 2024 16:43:48 GMT+0800 (China Standard Time)

@gleb-cloudius Do we drain nodes before stopping in rolling restart procedure on prod clusters?

We can ask our Field engineers. @tarzanek could you help answering this?

But I suspect the manual drain is redundant -- graceful shutdown should already drain automatically before stopping the process.

Yaniv Kaul · Answer 43 · Tue May 28 2024 16:50:36 GMT+0800 (China Standard Time)

@gleb-cloudius Do we drain nodes before stopping in rolling restart procedure on prod clusters?

We can ask our Field engineers. @tarzanek could you help answering this?

It's in Siren's code. But this is an OSS issue, so I won't paste the link. Generally, we do. And we have a timeout between drain and restart too, btw.

Gleb Natapov · Answer 44 · Tue May 28 2024 21:56:25 GMT+0800 (China Standard Time)

I wrote a test.py test for rolling restart:

#
# Copyright (C) 2023-present ScyllaDB
#
# SPDX-License-Identifier: AGPL-3.0-or-later
#
from test.pylib.manager_client import ManagerClient
from test.pylib.internal_types import ServerInfo
from test.pylib.util import unique_name

from cassandra.cluster import Session, ConsistencyLevel
from cassandra.query import SimpleStatement

import asyncio
import time
import pytest
import logging

logger = logging.getLogger(__name__)

@pytest.mark.asyncio
async def test_rolling_restart(request, manager: ManagerClient):
    """Test cluster rolling restart"""

    [await manager.server_add() for _ in range(6)]

    servers = await manager.running_servers()

    finish_writes = await start_writes(manager.cql)

    for s in servers:
        logger.info(f"Restarting node {s}")
        await manager.server_restart(s.server_id, wait_others=0)
#        await manager.servers_see_each_other(servers)
    
    await finish_writes()

async def start_writes(cql: Session, concurrency: int = 3):
    logger.info(f"Starting to asynchronously write, concurrency = {concurrency}")

    stop_event = asyncio.Event()

    ks_name = unique_name()
    await cql.run_async(f"CREATE KEYSPACE {ks_name} WITH replication = {{'class': 'NetworkTopologyStrategy', 'replication_factor': 3}}")
    await cql.run_async(f"USE {ks_name}")
    await cql.run_async(f"CREATE TABLE tbl (pk int PRIMARY KEY, v int)")

    # In the test we only care about whether operations report success or not
    # and whether they trigger errors in the nodes' logs. Inserting the same
    # value repeatedly is enough for our purposes.
    stmt = SimpleStatement("INSERT INTO tbl (pk, v) VALUES (0, 0)", consistency_level=ConsistencyLevel.QUORUM)

    async def do_writes(worker_id: int):
        write_count = 0
        error_count = 0
        while not stop_event.is_set():
            start_time = time.time()
            try:
                await cql.run_async(stmt)
                write_count += 1
            except Exception as e:
                error_count += 1
                logger.error(f"Write started {time.time() - start_time}s ago failed: {e}")
                pass
        logger.info(f"Worker #{worker_id} did {write_count} successful writes")
        if error_count != 0:
            raise Exception(f"Thread {worker_id} encountered {error_count} errors while writing")

    tasks = [asyncio.create_task(do_writes(worker_id)) for worker_id in range(concurrency)]

    async def finish():
        logger.info("Stopping write workers")
        stop_event.set()
        await asyncio.gather(*tasks)

    return finish

With await manager.servers_see_each_other(servers) commented out it fail with or without waiting for gossiper to settle. With the line uncommented the test work.

Roy Dahan · Answer 45 · Wed May 29 2024 05:53:14 GMT+0800 (China Standard Time)

So, what’s the verdict?
Do you want to introduce the workaround to SCT?
(We have such a thing in other places).

If so, we should later open an RFE or a bug for the bad UX that one need to verify that all nodes see each other as UN.
I assume that in most cases if one will just check the for UN in the node that is about to be restarted and this node see all others as UN, it will work, but probably not safe enough.

Yaniv Kaul · Answer 46 · Wed May 29 2024 13:02:53 GMT+0800 (China Standard Time)

So, what’s the verdict? Do you want to introduce the workaround to SCT? (We have such a thing in other places).

Yes.

If so, we should later open an RFE or a bug for the bad UX that one need to verify that all nodes see each other as UN. I assume that in most cases if one will just check the for UN in the node that is about to be restarted and this node see all others as UN, it will work, but probably not safe enough.

#8275 already exists. Did not make progress on it though.

Gleb Natapov · Answer 47 · Wed May 29 2024 15:28:45 GMT+0800 (China Standard Time)

I ran some experiments yesterday with inserting networking delays into the loopback. With 10ms delay and with waiting for the gossiper to settle the test above passes. While with the same delay but without waiting for the gossiper it fail. The conclusion is that waiting for the gossiper to settle was, by accident, preventing SCT from failing though it did not follow correct rolling restart procedure. The SCT should be fixed.

dani-tweig · Answer 48 · Wed May 29 2024 17:42:01 GMT+0800 (China Standard Time)

removing the regression label: the problem is happening also in 5.4.
removing the release blocker label: the cause is in SCT.