x/ref/runtime/internal/naming/namespace.TestAuthorizationDuringResolve is flaky exposes RPC implementation race condition.
cosnicolaou opened this issue · comments
TestAuthorizationDuringResolve is flaky and it turns out the flakiness is due to underlying race condition in the error and context cancelation handling code. This is related to issue #40 and the reader should look at that issue also. In order the understand the problem it is necessary to understand some of the details of the RPC protocol. In particular, when a client connects to a server, it waits for the server to present its blessings/discharges to the client and the client then determines if it wishes to proceed with presenting its blessings to the server. This test exposes a race condition in the implementation of the error paths in both the client and the server. It does so by explicitly triggering the case where the client decides that it doesn't trust the server (all_test.go:720).
The protocol of this 'auth handshake' is as follows:
- server and client exchange setup messages (
flow/messages.Setup
). - server sends data messages (
messages.Data
) to the client with the blessings, followed by a messages.Auth to
indicate that the blessings have been encoded. This is in flow/conn/conn.go,NewAccepted
which callsacceptHandshake
. acceptHandshake
sends the blessings andmessages.Auth
and then callsreadRemoteAuth
to read the blessings sent by the client.- the client reads the blessings sent by the server, using
readRemoteAuth
but called viaNewDialed
anddialHandshake
. - the client decides if it wishes to continue based on the blessings received, note, that the server is already expecting to
receive blessings from the client (see 3) and is waiting in its invocation ofreadRemoteAuth
. - if the client decides to abort the interaction, it sends a
message.Teardown
to the server, or rather it should send amessage.Teardown
. However, there is a code path indialHandshake
where it may fail to do so due to a race between dialHandshake completing and the client giving up due to context cancelation (line 248 inflow/conn/conn.go
). That code path currently does not capture any error return by dialHandshake. The fix (PR #133) is simply to assign err to ferr to ensure that the connection is turn down in the event that the client decides not to proceed. Without the teardown the server will remain waiting for a blessings message inreadRemoteAuth
. readRemoteAuth
currently does not have any specific handling formessage.TearDown
which means that the teardown is handled by the invocation tohandleMessage
which returns a nil error indicating that the message was handled and consequentlyreadRemoteAuth
callsgetMsg
which will fail with an EOF. This results in a confusing and unclear error message, PR #133 addresses this by explicitly handing the Teardown message.- note that
readRemoteAuth
also has explicit handling for receiving requests to create new flows (message.OpenFlow
), that is less likely to be encountered following this change, however, as called in NewDialed (line 270) the possibility of a deadlock exists with two successive calls to the same server.
Fixed by #133