Allow server RPC handler to respond after timeout from the client (so that client can receive a richer error message)

Question

Allow server RPC handler to respond after timeout from the client (so that client can receive a richer error message)

tegefaulkes opened this issue 6 months ago · comments

Specification

Currently this is technically supported where the server handler can just ignore the timeout already. The problem is when the client side timeout just hard closes the stream. This is leading to a problem where a handler that expected to timeout before responding but still respond with a messages is not being allowed to complete by the client side.

For example, we have a nodesPing handler that has two expected cases.

Ping succeeds and returns with success message.
Ping fails and returns with failure message.

In case 2 the client is ending the RPC call before the response message is received. Leading to an error message in stead of the expected RPC message indicating ping failure.

This needs to be addressed in the RPC client. We need to add a grace time between the server and client timeout. The middleware that updates the handler timeout needs to account for this as well.

Additional context

Related: MatrixAI/Polykey-CLI#37

Tasks

Determine the best way to allow time for the server handler to respond after a timeout.
Implement this option.

Roger Qiu · Answer 1 · Mon Jan 22 2024 16:28:11 GMT+0800 (China Standard Time)

I don't like the idea of a grace time to resolve this. Can you think of a more deterministic protocol to solve this problem? I thought originally considered this already. That is when the client times out that's a client error, nothing to do with the server. If the server times out that's a server error returned back to the client. If they are the same time timeout then whichever comes first will be the origin of the response. So I disagree with the additional non-deterministic grace time in principle.

Brian Botha · Answer 2 · Tue Jan 23 2024 07:22:04 GMT+0800 (China Standard Time)

The problem is that the client will always go first because it doesn't deal with the network delay. As it stands the server can't gracefully handle a timeout at all if the client always closes the stream at the same time.

It's not really a grace time I'm suggesting, we just need the client side to have a longer timeout to allow the server to respond. It would be as simple as setting a longer default timeout for the client but the timeout middleware overrides that.

Roger Qiu · Answer 3 · Tue Jan 23 2024 07:37:29 GMT+0800 (China Standard Time)

Longer shorter is all still non-deterministic. The server should just do a noop if the client times out.

Or we allow the user to pass in a handler for client timeout - and this just logs out a warning message.

The client is allowed to timeout earlier. The server handler is supposed to gracefully terminate without error.

Brian Botha · Answer 4 · Tue Jan 23 2024 07:49:10 GMT+0800 (China Standard Time)

That's all fine. The main problem is we don't have ANY way for an RPC call to respond gracefully with a message after it times out. It defeats the point of the server handler timeout being advisory if the client just closes the stream before the server can respond.

I'd like for the handler to be able to end gracefully after it times out. The only other option is that it always errors. But we can't even provide a useful error since it's the client cancelling the stream.

The server side should have the main agency over the stream, the client timeout should only ever be the last option fail-safe for ending the stream.

Roger Qiu · Answer 5 · Tue Jan 23 2024 08:37:06 GMT+0800 (China Standard Time)

I don't understand this. Can you draw a diagram pointing to where the complication is.

Brian Botha · Answer 6 · Tue Jan 23 2024 09:00:38 GMT+0800 (China Standard Time)

Here is a diagram explaining the problem as it is.

And here is what we expected to happen

Note the following details.

The client call and server handler have separate timeouts.
The server handler timeout is reset using the timeout time of the client caller via the timeout middleware.
The server handler timeout is advisory as a signal that can be ignored. Up to the handler to decide what to do. In this case we treat the timeout as a failure to ping and respond with a message.
Client timeout cancels the RPC stream directly witch ends up cancelling the transport stream.

Roger Qiu · Answer 7 · Tue Jan 23 2024 09:18:07 GMT+0800 (China Standard Time)

This is a good explanation thanks. Let's do more of this in the future.

My initial thoughts are that this is due to 2 separately mixed up semantics of the client timeout parameter.

As a client I no longer care about the result from the server after 10,000ms.
As a client I am telling the server that it only has 10,000ms to do its work and therefore the server should tell me what failed to complete in that 10,000ms.

And because of this mixed up meaning, we have a UI/UX issue where the client side reports a generic client rpc timeout error, but not a ping failure error from the server side.

This can be difficult to solve. Because the solution depends on what we want at the top level.

If I think semantic 1. is the more important semantic, then the current behaviour as is, is correct barring the server erroring out, which it should not.

If I think semantic 2. is the more important semantic, then we would want to follow your diagram.

Is there a third way that might combine these together?

Brian Botha · Answer 8 · Tue Jan 23 2024 09:23:30 GMT+0800 (China Standard Time)

I prefer semantic 2, where the timeout handling is done by the server side. It has more information needed to decide how to handle it. That said, we do need a way for the client to handle bad actors. Where after a certain amount of time we give up on the RPC call and just cancel it.

Combining the two would just be having one time for how long we want the server to take and a 2nd time for how long we give up on the whole call. The difference between these two would be considered a grace time.

In my mind implementing both semantics just means having a slightly longer client fail-safe timeout.

Roger Qiu · Answer 9 · Tue Jan 23 2024 09:23:59 GMT+0800 (China Standard Time)

It seems that your proposed solution is give the client a grace timeout of say 1000ms that allows the server to respond with an error. If the server doesn't respond with that error, then the client should timeout with a generic timeout error?

Brian Botha · Answer 10 · Tue Jan 23 2024 09:27:08 GMT+0800 (China Standard Time)

Pretty much, but less so a grace timeout and just have two separate call timeout and kill timeout options. Depends how we want to go about it.

Roger Qiu · Answer 11 · Tue Jan 23 2024 09:31:14 GMT+0800 (China Standard Time)

I think just adding 1000 to the internal client timeout is sufficient. That is if the client timeout is 10000, the real client timeout is 11000, but 10000 is communicated to the server via metadata.

Brian Botha · Answer 12 · Tue Jan 23 2024 09:32:09 GMT+0800 (China Standard Time)

Sounds good.

Brian Botha · Answer 13 · Tue Jan 23 2024 13:51:00 GMT+0800 (China Standard Time)

Looking at making the change, it's not quite that simple of just adding 1000 to the client timeout separate from what's sent to the server.

The timer can be the default or specified using the ctx when making the call. It could be a number or an existing timer when provided to the ctx.
The timeout middleware gets the timeout value from the timer's delay. We can't know if was set by the default or passed in.

So there's no real good place to add the grace time to the timer, the only real option is to subtract the value in the timeout middleware. But subtraction leaves us open to some invalid values. How do we want to handle this?

The main problem is that we end up with negative timeout sent to the server. Minimally we should clamp this to 0 but this means any timeout specified that is less than the grace time could end up with some weird implicit behaviour. For example, if a call is made with a timeout less than the grace timer then the handler will timeout before It can do anything.

I'm not sure how we want to go about this.

Roger Qiu · Answer 14 · Tue Jan 23 2024 14:06:08 GMT+0800 (China Standard Time)

Create a new timer from the old timer?

Roger Qiu · Answer 15 · Tue Jan 23 2024 14:06:19 GMT+0800 (China Standard Time)

Wrap the method calls?

Roger Qiu · Answer 16 · Tue Jan 23 2024 14:06:34 GMT+0800 (China Standard Time)

Just don't mutate the original timer.

Roger Qiu · Answer 17 · Tue Jan 23 2024 14:09:23 GMT+0800 (China Standard Time)

Subtraction is not correct here. This is a UI/UX problem. Don't create new UI/UX problems when solving 1 UI/UX problem.

Brian Botha · Answer 18 · Tue Jan 23 2024 14:20:13 GMT+0800 (China Standard Time)

On reflection it should be simple enough to have the timeout trigger a 2nd stage grace timer before cancelling the stream.

Brian Botha · Answer 19 · Wed Jan 24 2024 09:09:56 GMT+0800 (China Standard Time)

I've stumbled onto this really weird addition to the RPCClient and RPCServer.

  public registerOnTimeoutCallback(callback: () => void) {
    this.onTimeoutCallback = callback;
  }

    // and in the duplex caller.
    void timer.then(
      async () => {
        abortController.abort(timeoutError);
        // added here
        if (this.onTimeoutCallback) {
          this.onTimeoutCallback();
        }
      },
      () => {}, // Ignore cancellation error
    );

Looks like it was added by @addievo for testing use. We tend to steer way from adding features like this just for testing so I don't think this should've been included.

This should be removed, there are better less invasive ways to tell if the RPC has timed out during testing.

Roger Qiu · Answer 20 · Mon Jan 29 2024 11:07:52 GMT+0800 (China Standard Time)

When the client gets a remote RPC error, we currently show the remote RPC error AND also the casue being from the agent. This provides context that the error is in fact from the remote agent, however it's still a bit verbose, we should figure out how best to present errors... and I'm wondering whether it makes sense to invert it, so we show the deepest error, and then going up the cause chain, rather than going down the cause chain.

This was from Sunday, and I noticed that when the agent hit an error, and of course we get an error RPC remote, then... it was kind of repetitive to keep saying remote error cause any error from the agent is remote. So I imagine rather than:

RemoteError
  ActualError

It should be

ActualError
  RemoteError

That way the idea we flip the cause chain, and say, this is the "origin" of the error, and it was later wrapped as whatever it is.

So what is the opposite of cause? Like effect? This might make more sense:

ActualError
  then: RemoteError

ActualError
  by: RemoteError

ActualError
   followed: RemoteError

Roger Qiu · Answer 21 · Mon Jan 29 2024 11:08:04 GMT+0800 (China Standard Time)

When the client gets a remote RPC error, we currently show the remote RPC error AND also the casue being from the agent. This provides context that the error is in fact from the remote agent, however it's still a bit verbose, we should figure out how best to present errors... and I'm wondering whether it makes sense to invert it, so we show the deepest error, and then going up the cause chain, rather than going down the cause chain.

This was from Sunday, and I noticed that when the agent hit an error, and of course we get an error RPC remote, then... it was kind of repetitive to keep saying remote error cause any error from the agent is remote. So I imagine rather than:
RemoteError
  ActualError
It should be
ActualError
  RemoteError
That way the idea we flip the cause chain, and say, this is the "origin" of the error, and it was later wrapped as whatever it is.

So what is the opposite of cause? Like effect? This might make more sense:
ActualError
  then: RemoteError

ActualError
  by: RemoteError

ActualError
   followed: RemoteError

I leaning towards "then".

Brian Botha · Answer 22 · Mon Jan 29 2024 11:16:22 GMT+0800 (China Standard Time)

That seems like a reasonable idea. My first thought was parent for the chain but that might be a bit to programmer language for a readability.