andoco/grpc-distributed-exercise

Solution to two variants of a network messaging problem supporting reconnection logic.

See the client and server docs for instructions on each.

Design Notes

Variant 1

gRPC was chosen to allow use of its streaming RPCs. This provides convenient guarantees of message ordering within the stream, but does not guarantee delivery of messages to the other party. Ensuring delivery would require receiving an acknowledgement via another stream, but this has not been implemented.

The protocol buffers definition uses two RPCs, Begin and Resume. Two separate calls were chosen to be more explicit about the intent of the call, rather than using a single Begin call with an optional seed field that could be left empty accidentally.

Once a call is received, the server starts streaming the value sequence from an initial random seed, or from a seed provided by the client.

The client keeps track of the total number of values received, and keeps a running sum for printout at the end of the stream. This avoids the need to buffer all the received values.

If the client encounters any error when receiving a value, it will attempt to reconnect using an exponential backoff, and then call the Resume RPC. This could be improved by only reconnecting for certain errors.

NOTE: There is a bug in this seed handling in the implementation. It should resume from firstValue * 2^(totalReceived-1). This could be fixed by including the first value and total number of values received in the Resume call, and letting the server resume from the next calculated number.

Variant 2

gRPC was used again for the same reasons as in variant 1. Given that we're now generating a checksum this could be used to guarantee delivery of all messages.

Two RPCs for Begin and Resume are again used to be more explicit.

Session State

The server stores session state against each client id in a map to allow resuming. This includes the maximum number of values to send to the client, and the initial seed value used by the PRNG.

Simple mutex read/write locking is used for session state access, and a background routine handles cleaning up old session state for inactive clients based on their LastActive timestamp. It may be expensive to update this LastActive value on every number sent to the client, so we could potentially restrict the rate at which this happens.

The cleanup routine is currently not very efficient as it involves locking the entire map while pruning the old session state. If we had many active clients we'd want to avoid this potential delay, perhaps using a queue of clients to check for expiry.

PRNG

Using the Go standard library rand package, we need to manually fast-forward to the next random number when resuming. If we didn't do this we'd have repeated values. If we had access to the underlying PRNG state we could persist this and avoid fast-forwarding. This could be possible using a custom rand.Source.

Checksum

The checksum could have been generated by caching up all transmitted values (inluding their order index), and calculating the checksum over all the appended bytes. This could be memory intensive though, particulartly on the server with many concurrent clients.

A more efficient solution would be to have an agreed buffer size on the client and server, and keep a running checksum whenever the buffer fills up, or when the end of stream is received.

Another possibility would be having a custom checksum function where we can recalculate it on every value, so that no buffering is necessary.

andoco / grpc-distributed-exercise