smfrpc / smf

Fastest RPC in the west

Home Page:http://smfrpc.github.io/smf/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Requests after rpc_client reconnect fail with gate_closed_exception

nacl opened this issue · comments

Describe the bug
After a connection times out and rpc_client::reconnect() is called (as suggested by the logged exception), all future requests immediately fail, with the future containing a seastar::gate_closed_exception.

This seems to be caused by rpc_client's gate being closed when rpc_client::stop() is called. Unfortunately, as of writing, seastar::gates cannot be reopened once closed, so I fixed this locally by wrapping the rpc_client's dispatch_gate_ in a std::unique_ptr, and resetting it whenever connect() is invoked. There may be a better way to solve this problem.

To Reproduce
Steps to reproduce the behavior:

  1. Start an RPC server.
  2. Create an RPC client with options setting a connection timeout to 1s.
  3. Run the RPC client, and have the local shard sleep for 2s. After 1s, SMF will log something like this:
[shard 0] smf - rpc_client.cc:242] Handling exception: seastar::timed_out_error (timedout)

This line number may not match what's in the below-mentioned git hash.

If you were to make another request at this point, the error message suggests that I should reconnect().

  1. Invoke the RPC client's reconnect() method, and in its continuation, issue an RPC to the server. The RPC will fail. If the exception is logged, it will be a seastar::gate_closed_exception.

Expected behavior
The RPC in step 4 should reach the server and (server willing) succeed..

Environment Info

OS: Ubuntu 16.04
Git hash: c7e8c58
Seastar git hash: f0c0855
Compiler: gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~16.04~ppa1)

Excellent report!!!

Will put a fix.

@nacl - not sure how you didn't run into the separate bugs when this fixed was put in place.

See #328 for what I think is a proper fix. Basically this would trigger a case when the server didn't properly gate the cleanup future. So there was a race to call the ~rpc_server_connection().

/cc @noahdesu

@nacl this is the output too


➜  debug git:(bugfix/client_reconnect) ./bin/smf_rpc_reconnect_with_timeout_integration_test -c 1 -m 2G 

WARNING: debug mode. Not for benchmarking or production
WARN  2019-03-05 16:01:19,579 seastar - Seastar compiled with default allocator, heap profiler not supported
WARN  2019-03-05 16:01:19,597 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
INFO  2019-03-05 16:01:19,603 [shard 0] smf - rpc_server.cc:73] Starging server:rpc_server{args.ip=127.0.0.1, args.flags=1, args.rpc_port=29973, args.http_port=54447, rpc_routes=rpc_handle_router{rpc_service{name=SmfStorage,handles=rpc_service_method_handle{name=Get}}}, limits=rpc_connection_limits{max_mem:2.000 GB, max_body_parsing_duration: 60000ms, res_avail:2.000 GB (2147483648)}, incoming_filters=0, outgoing_filters=0}
INFO  2019-03-05 16:01:19,603 [shard 0] smf - rpc_server.cc:94] Starting rpc server
INFO  2019-03-05 16:01:19,604 [shard 0] smf - main.cc:65] Iteration: 0
INFO  2019-03-05 16:01:19,624 [shard 0] smf - rpc_client.cc:246] Handling exception: seastar::timed_out_error (timedout)
INFO  2019-03-05 16:01:19,624 [shard 0] smf - D rpc_client.cc:181] Disabling connection to: 127.0.0.1:29973
INFO  2019-03-05 16:01:19,624 [shard 0] smf - rpc_client.cc:194] Setting exceptional state for session: 1
INFO  2019-03-05 16:01:19,625 [shard 0] smf - main.cc:65] Iteration: 1
INFO  2019-03-05 16:01:19,627 [shard 0] smf - main.cc:77] Success. Reconnected and method returned. Server Iteration: 1
INFO  2019-03-05 16:01:19,627 [shard 0] smf - main.cc:65] Iteration: 2
INFO  2019-03-05 16:01:19,628 [shard 0] smf - main.cc:77] Success. Reconnected and method returned. Server Iteration: 1
INFO  2019-03-05 16:01:19,628 [shard 0] smf - D rpc_client.cc:181] Disabling connection to: 127.0.0.1:29973
INFO  2019-03-05 16:01:19,629 [shard 0] smf - rpc_server.cc:135] Stopped seastar::accept() calls
INFO  2019-03-05 16:01:19,629 [shard 0] smf - rpc_server.cc:123] Shutting down server with expected exit codes
INFO  2019-03-05 16:01:19,707 [shard 0] smf - D rpc_server.cc:261] Cannot send respond. client connection '1' is invalid. Skipping reply from server
ERROR 2019-03-05 16:01:19,708 [shard 0] smf - rpc_server.cc:285] There was an error with the connection: Error parsing connection header
INFO  2019-03-05 16:01:19,708 [shard 0] smf - rpc_server.cc:288] Closing connection for client: 127.0.0.1:51248

Meant to reply to this earlier:

@nacl - not sure how you didn't run into the separate bugs when this fixed was put in place.

See #328 for what I think is a proper fix. Basically this would trigger a case when the server didn't properly gate the cleanup future. So there was a race to call the ~rpc_server_connection().

Yeah, I unfortunately didn't really have time to dig too deeply into this.

Thanks for investigating and fixing this promptly. The change has been working well so far.