Is it ok to call grpc.UnaryInvoker multiple times inside an interceptor?

Question

Is it ok to call grpc.UnaryInvoker multiple times inside an interceptor?

s-matyukevich opened this issue a month ago · comments

Sergey Matyukevich commented a month ago

We have a legacy retry interceptor that does something like this (simplified for clarity):

func UnaryClientRetryInterceptor() grpc.UnaryClientInterceptor {
	return func(
		ctx context.Context, method string, req, reply interface{},
		cc *grpc.ClientConn, invoker grpc.UnaryInvoker, opts ...grpc.CallOption,
	) (err error) {
		err := invoker(reqCtx, method, req, reply, cc, opts...)
                if err != nil {
                        err = invoker(reqCtx, method, req, reply, cc, opts...)
                }
                return err
	}

}

I know we can use grpc native retries instead, but the code above seems to be working fine and my question is whether it is just a coincidence and we are abusing grpc interceptors interface here, or grpc interceptors are allowed to be called this way and we can assume that this code will keep working in the future grpc releases?

To give you more context, I can also mention that recently there was an attempt to modify this interceptor to call the second request in parallel to implement hedging. It still works, but we discovered some data races in our tests, which lead us to the question described in this issue. Now we are thinking whether we should contribute hedging implementation to grpc-go accordingly to https://github.com/grpc/proposal/blob/master/A6-client-retries.md instead. (looks like it is still missing in grpc-go, is that correct?) One additional feature that this gRFC is missing is the ability to dynamically set hedgingDelay based on p99 latency. I think we can solve this by using a custom resolver and interceptor combination, which will measure the latency, calculate p99 percentile and update service config. Does this sound like the right approach?

Antoine Tollenaere · Answer 1 · Thu Jul 18 2024 16:50:51 GMT+0800 (China Standard Time)

I know we can use grpc native retries instead, but the code above seems to be working fine and my question is whether it is just a coincidence and we are abusing grpc interceptors interface here, or grpc interceptors are allowed to be called this way and we can assume that this code will keep working in the future grpc releases?

It didn't take me long to find open source code that calls invoker more than once: https://github.com/grpc-ecosystem/go-grpc-middleware/blob/main/interceptors/retry/retry.go. Changing this would likely break a lot of users.

(looks like it is still missing in grpc-go, is that correct?) One additional feature that this gRFC is missing is the ability to dynamically set hedgingDelay based on p99 latency. I think we can solve this by using a custom resolver and interceptor combination, which will measure the latency, calculate p99 percentile and update service config. Does this sound like the right approach?

Implementing hedging based on A6 in Go would certainly be useful. It'd be nice to hear about how that has played out for Java. Regarding the custom resolver to tune the delay based on measured latency, the feature itself sounds useful, but having an interceptor (I think that'd even need to be a stats handler to measure attempts latency precisely) that feeds back to a resolver feels unnecessarily complicated to me.

So, there's two questions:

for the overall gRPC project, which is whether there would be interest in dynamically tuned hedging as an extension to A6. Perhaps @ejona86 has input to give on this one.
for grpc-go specifically, whether or not it is OK to call invoker concurrently in an interceptor (and here it'd be interesting to know whether that works in other languages, too). IIUC it can currently it can result in data races.

Eric Anderson · Answer 2 · Thu Jul 18 2024 23:41:43 GMT+0800 (China Standard Time)

In Java, you can call the next Channel (equivalent of the invoker) multiple times concurrently. We originally didn't allow calling it multiple times, but changed it a few months later (this was in 2015) to allowing multiple times. It didn't matter to grpc-java's architecture whether the multiple times were concurrent or sequential.

I'd say dynamically tuned hedging inside gRPC would probably be seen as premature. Java is still the only language to have hedging at all, I believe, and it has taken some effort to bang out the bugs. It seems not a lot of people have used it, as well. There are some things that would need to be worked out, like the histogram bins, minimum number of RPCs before it takes effect, and dangers due to the service's latency suddenly increasing.

It might be a bit easier to have a name resolver and load balancer coordinate (not interceptor), since the NR can already pass objects to the LB. There's some complexity there, but honestly it feels pretty tame to me; it is mostly boilerplate and maybe working in unfamiliar APIs. It seems it'd be pretty reliable and not cause random bugs. That seems like a good way to prototype and I don't think much of it would be throwaway.

Doug Fawley · Answer 3 · Thu Jul 18 2024 23:54:05 GMT+0800 (China Standard Time)

for grpc-go specifically, whether or not it is OK to call invoker concurrently in an interceptor (and here it'd be interesting to know whether that works in other languages, too). IIUC it can currently it can result in data races.

Where are you thinking the races would occur? That isn't something I'd expect.

We can add tests for both of these cases (concurrent & successive calls to the invoker) if you are interested in making sure we guarantee this.

There are some things that would need to be worked out

Another big issue I see with our current hedging design is that there is no way to ask the LB policy to avoid using backends that are already processing the same RPC, or alternatively having the channel skip the hedge attempt if the same subchannel was chosen.

Sergey Matyukevich · Answer 4 · Fri Jul 19 2024 00:40:28 GMT+0800 (China Standard Time)

@dfawley Here is the file with the full stacktrace that describes the race
datarace.txt

The race happens inside HeaderCallOption which is used by another interceptor. We can provide a simple reproduction for this if it is useful.

Doug Fawley · Answer 5 · Fri Jul 19 2024 01:08:26 GMT+0800 (China Standard Time)

Ah that makes sense.. yes, some of our call options for unary RPCs use pointers to communicate results back to the call site. In this case, I don't think we can support concurrent calls into invoker with those options. You'd need to detect those and substitute your own with your own synchronization around it.

E.g.

// Each hedging attempt:
// First copy "opts" from input's CallOptions, then:
for i, o := range opts {
	if o.(*HeaderCallOption) {
		opts[i] = grpc.Header(&myHeader)
	}
}
// Call invoke, and on success, copy from `myHeader` to the user's `HeaderCallOption`'s pointer.

This is not simple, unfortunately, since there's also TrailerCallOption, PeerCallOption, OnFinishCallOption, etc, and you'd never be able to be future-proof.

Sergey Matyukevich · Answer 6 · Fri Jul 19 2024 03:25:46 GMT+0800 (China Standard Time)

Thanks for the answers! One last question that I have is this: if we decide to contribute hedging implementation to grpc-go accordingly to existing gRFC (without dynamic hedging delay) do you foresee any blockers or technical issues? Specifically I am thinking about this comment

Another big issue I see with our current hedging design is that there is no way to ask the LB policy to avoid using backends that are already processing the same RPC, or alternatively having the channel skip the hedge attempt if the same subchannel was chosen.

Maybe we can simply ignore this and rely on the fact that the probability of choosing the same backend is very low? Also I think the same problem applies to normal retries as well.

Doug Fawley · Answer 7 · Fri Jul 19 2024 03:52:01 GMT+0800 (China Standard Time)

Specifically I am thinking about this comment

This is regarding the design itself, and yes, it applies to basic retry as well. That issue would not be a blocker to doing the implementation according to the design.

You should probably wait for #7356 to be done before embarking on any implementation work for this feature, as it would probably impact this to some extent.