vanadium / core

Slimmed down version of Vanadium that is focused on its RPC and security system.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

x/ref/runtime/internal/naming/namespace.TestAuthorizationDuringResolve is flaky exposes RPC implementation race condition.

cosnicolaou opened this issue · comments

TestAuthorizationDuringResolve is flaky and it turns out the flakiness is due to underlying race condition in the error and context cancelation handling code. This is related to issue #40 and the reader should look at that issue also. In order the understand the problem it is necessary to understand some of the details of the RPC protocol. In particular, when a client connects to a server, it waits for the server to present its blessings/discharges to the client and the client then determines if it wishes to proceed with presenting its blessings to the server. This test exposes a race condition in the implementation of the error paths in both the client and the server. It does so by explicitly triggering the case where the client decides that it doesn't trust the server (all_test.go:720).

The protocol of this 'auth handshake' is as follows:

  • server and client exchange setup messages (flow/messages.Setup).
  • server sends data messages (messages.Data) to the client with the blessings, followed by a messages.Auth to
    indicate that the blessings have been encoded. This is in flow/conn/conn.go, NewAccepted which calls acceptHandshake.
  • acceptHandshake sends the blessings and messages.Auth and then calls readRemoteAuth to read the blessings sent by the client.
  • the client reads the blessings sent by the server, using readRemoteAuth but called via NewDialed and dialHandshake.
  • the client decides if it wishes to continue based on the blessings received, note, that the server is already expecting to
    receive blessings from the client (see 3) and is waiting in its invocation of readRemoteAuth.
  • if the client decides to abort the interaction, it sends a message.Teardown to the server, or rather it should send a message.Teardown. However, there is a code path in dialHandshake where it may fail to do so due to a race between dialHandshake completing and the client giving up due to context cancelation (line 248 in flow/conn/conn.go). That code path currently does not capture any error return by dialHandshake. The fix (PR #133) is simply to assign err to ferr to ensure that the connection is turn down in the event that the client decides not to proceed. Without the teardown the server will remain waiting for a blessings message in readRemoteAuth.
  • readRemoteAuth currently does not have any specific handling for message.TearDown which means that the teardown is handled by the invocation to handleMessage which returns a nil error indicating that the message was handled and consequently readRemoteAuth calls getMsg which will fail with an EOF. This results in a confusing and unclear error message, PR #133 addresses this by explicitly handing the Teardown message.
  • note that readRemoteAuth also has explicit handling for receiving requests to create new flows (message.OpenFlow), that is less likely to be encountered following this change, however, as called in NewDialed (line 270) the possibility of a deadlock exists with two successive calls to the same server.

Fixed by #133