Dropping the last reference to a Server should not cancel outstanding calls

Question

Dropping the last reference to a Server should not cancel outstanding calls

zenhack opened this issue a year ago · comments

Right now, the server package invokes method calls with a context that is canceled when either the server's context is cancelled (i.e. the last reference was dropped), or the context passed in the original call is canceled.

I think this is not the semantics we want; it means you can't write code like:

fut, rel := cap.Foo(ctx, ...)
defer rel()
cap.Release() // Done calling methods on cap, drop the reference now.
res, err := fut.Struct() // may spuriously return an error, since we released cap

Given that the caller already has a way to cancel the method call (cancel the context that was passed in), I think we should change this so that that is the only way the call will be cancelled. Note that in the case where the call is coming in from a connection, the call's context will still be bound to the lifetime of the connection.

I think I'm hitting this in tempest in a case where I have pipelined streaming calls on a cap, and then I .Release() it after sending the last call, inadvertently causing in-flight calls to be cancelled.

@lthibault, thoughts?

Louis Thibault · Answer 1 · Sat Jan 14 2023 04:01:14 GMT+0800 (China Standard Time)

I actually have some rather important code that depends on existing semantics. Specifically, long-running method calls (forked via call.Go()) needs a mechanism for detecting that the server is being shut down, so that they can return. Otherwise, the Server.Shutdown() method will block indefinitely.

Here's a potential solution though. We could take a page out of net/http's book, and add a .Context() method to the RPC's call parameter. The idea is that the ctx explicitly passed into Foo(ctx, call) corresponds to the server context, and call.Context() corresponds to the call's context. In your example, I think you would only want fut.Struct() to return the call context's error, as it would indicate a business-logic error in the server's method. The caller is already aware, as you've noted, of any cancelation signal to the server context.

In many ways, I think this would give us the best of both worlds. We would no longer have to fork a goroutine in handleCalls to bind the call and server contexts, and it can be useful to distinguish between "call is aborted" and "server is shutting down".

Ian Denhardt · Answer 2 · Sat Jan 14 2023 05:20:18 GMT+0800 (China Standard Time)

I have a visceral reaction to just exposing two separate contexts to app code. I'll think on it.

Louis Thibault · Answer 3 · Sat Jan 14 2023 05:49:36 GMT+0800 (China Standard Time)

Yeah, I had the same feeling at first. If I may anticipate your concerns, the thing that initially bothered me was the issue of deciding which context to pass to any subroutines in the RPC method. For example:

func (s someServer) Foo(ctx context.Context, call Fooer_foo) error {
    // ...

    err := s.doTheThing(ctx, arg1, arg2)  // ctx or call.Context()?

    // ...
}

But after some thought, I came to the conclusion that the solution is to just pass the call parameter to the subroutine along with ctx, so that the function can select against both.

If that is not viable for whatever reason, the contexts can be bound at the call site:

func (s someServer) Foo(ctx context.Context, call Fooer_foo) error {
    ctx, cancel := context.WithCancel(ctx)
    defer cancel()

    cherr := make(chan error, 1)
    go func() {
        cherr <- s.doTheThing(ctx, arg1, arg2)
    }()

    select {
    case <-call.Context().Done()
        return ctx.Err()
    case err := <-cherr:
        return err
    }
}

I think this is preferable to having the Server implicitly spawn a goroutine for each method call, most of which will be short-lived and/or be amenable to selecting against ctx and call.Context().

Ian Denhardt · Answer 4 · Sat Jan 14 2023 06:50:46 GMT+0800 (China Standard Time)

For posterity: we discussed this and as it turns out @lthibault's code can just cancel the context used for the call (and it sounds like actually has already changed in that direction?) So we've decided to move forward with my original proposal.

Ian Denhardt · Answer 5 · Sat Jan 14 2023 13:42:46 GMT+0800 (China Standard Time)

So while trying to implement this, I noticed a few places in the test suite actually rely on the property that if you release the last reference, Shutdown() will be called before Release() returns

This seems like a very dubious thing to promise, since it only holds for local Servers, not other implementations of ClientHook; I think we should just rework the tests not to assume this, though I don't quite know how much will come out when I pull on that thread.

I think I'll fix the other bugs before coming back to this one.

Ian Denhardt · Answer 6 · Fri Jan 20 2023 12:59:04 GMT+0800 (China Standard Time)

Adding this to the 3.0 milestone, since it would be a breaking change.

Ian Denhardt · Answer 7 · Mon Apr 10 2023 09:48:31 GMT+0800 (China Standard Time)

So related to this, I realized that .Bootstrap() works similarly, in that if you do:

ctx, cancel := context.WithCancel(context.Background())
client := conn.Bootstrap(ctx)
fut, rel := Foo(client).Bar(context.Background(), nil)
cancel()

The call to Bar() may fail if it is pipelined on the answer to the bootstrap message, rather than send after the bootstrap returns. I think we should adjust the behavior of rpc.bootstrapClient so that it is consistent with what every other implementation of ClientHook does here now (and update the docs to ClientHook.Shutdown(), which prescribe the semantics that bootstrapClient implements).

Louis Thibault · Answer 8 · Mon Apr 10 2023 23:30:39 GMT+0800 (China Standard Time)

Oh, I think this might explain the bug I mentioned a few days ago, wherein pipelined calls occasionally return "unimplemented".

Should we (re)open an issue for this?

Ian Denhardt · Answer 9 · Tue Apr 11 2023 04:07:13 GMT+0800 (China Standard Time)

Yeah, let's re-open.