Impossible to trigger error "Failed to start coordinator client"

Question

Impossible to trigger error "Failed to start coordinator client"

mszulcz-mitre opened this issue 2 years ago · comments

mszulcz-mitre commented 2 years ago

Affected Branch

trunk

Basic Diagnostics

I've pulled the latest changes on the affected branch and the issue is still present.
The issue is reproducible in docker

Description

Consider the following branch in sentinel_2pc::controller::init(src/uhs/twophase/sentinel_2pc/controller.cpp, lines 38-41):

        if(!m_coordinator_client.init()) {
            m_logger->error("Failed to start coordinator client");
            return false;
        }

The top line checks if the coordinator client fails to start. If it does fail, the next lines should log the error "Failed to start coordinator client" and should return from the method with a value of false. Surprisingly, it is impossible to execute this branch because m_coordinator_client.init always returns true.

The culprit seems to be rpc::tcp_client::init (src/util/rpc/tcp_client.hpp, lines 54 - 68), which is in the call stack of m_coordinator_client.init. According to its description, rpc::tcp_client::init “connects to … server endpoints”. Its body indicates that if the connection fails, it should return false:

     [[nodiscard]] auto init() -> bool {
            if(!m_net.cluster_connect(m_server_endpoints, false)) {
                return false;
            }
	…
        }

However, it can never return false. This is due to the fact that cluster_connect is called with its 2nd argument equal to false. Since cluster_connect(m_server_endpoints, false) always returns true, its caller rpc::tcp_client::init always returns true, and ultimately m_coordinator_client.init always returns true. Here's a summary of the call stack:

m_net.cluster_connect(m_server_endpoints, false) (always true)
- rpc::tcp_client::init (always true)
  - coordinator::rpc::init (always true)
    - m_coordinator_client.init (always true)

Because sentinel_2pc::controller::init never raises an error if the coordinator client fails to start, an infinite loop can be triggered. The loop is in the body of sentinel_2pc::controller::send_compact_tx (src/uhs/twophase/sentinel_2pc/controller.cpp, lines 194-203):

    // TODO: add a "retry" error response to offload sentinels from this
    //       infinite retry responsibility.
    while(!m_coordinator_client.execute_transaction(ctx, cb)) {
        // TODO: the network currently doesn't provide a callback for
        //       reconnection events so we have to sleep here to
        //       prevent a needless spin. Instead, add such a callback
        //       or queue to the network to remove this sleep.
        static constexpr auto retry_delay = std::chrono::milliseconds(100);
        std::this_thread::sleep_for(retry_delay);
    };

When the coordinator client fails to start but an error is not raised, calls to m_coordinator_client.execute_transaction will return false and the loop will execute indefinitely. The author apparently recognized this possibility, but there’s currently no code to log error or debug messages and gracefully exit the loop.

Triggering the infinite loop is easy: just provide a junk coordinator IP address (e.g. “abcdefg”) for the unit tests in controller_test.cpp. Specifically, replace lines 24-25 in SetUp in controller_test.cpp with the following:

        const auto coordinator_ep
            = std::make_pair("abcdefghijklmnopqrstuvwxyz", m_coordinator_port);

How should this be addressed? It seems reasonable to just replace the call to m_net.cluster_connect(m_server_endpoints, false) with m_net.cluster_connect(m_server_endpoints, true), which would allow it to return false if the cluster connection fails. This would allow sentinel_2pc::controller::init to fail if the coordinator client fails to start and would preclude triggering the infinite loop.

Code of Conduct

I agree to follow this project's Code of Conduct

Sam Stuewe · Answer 1 · Sat Jun 18 2022 00:56:31 GMT+0800 (China Standard Time)

It seems reasonable to just replace [false] with [true]…

I agree; at least on the surface, that seems eminently reasonable. It's been a little while since I poked the rpc layer; so, out of interest, have you given a shot to changing that? Did you run into any immediate issues?

mszulcz-mitre · Answer 2 · Sun Jun 19 2022 14:32:13 GMT+0800 (China Standard Time)

Unfortunately, changing false to true causes a unit test to fail. The good news is that, as written, I think it should fail. Here’s the test up to the failing assertion (tests/unit/rpc/tcp_test.cpp):

TEST(tcp_rpc_test, send_fail_test) {
    using request = int64_t;
    using response = int64_t;

    auto client = cbdc::rpc::tcp_client<request, response>(
        {{cbdc::network::localhost, 55555},
         {cbdc::network::localhost, 55556}});
    ASSERT_TRUE(client.init());

Before I changed false to true, client.init() was guaranteed to always return true and the assertion always passed. After the change, it became possible for it to return false and the assertion failed. In this test, I think it actually should fail because the test doesn’t instantiate a server for the client to connect to. Without a server, the client initialization fails. This failure to connect to a server was always happening before the bug fix, but the failure was suppressed by the fact that cluster_connect in rpc::tcp_client::init was being forced to always return true.

mszulcz-mitre · Answer 3 · Sun Jun 19 2022 14:36:26 GMT+0800 (China Standard Time)

I made the proposed change in Pull Request #135. In addition to the bug fix itself, the PR modifies the unit test that failed to account for the new behavior expected after the bug fix. Finally, the PR adds a new unit test to trigger the error that was previously impossible.

Right now the PR is a draft. It's passed all the checks, so if you agree with the changes we're discussing, I'll click "Ready for review".