How to deal with Inconsistent execution timeout?

Question

How to deal with Inconsistent execution timeout?

yangxikun opened this issue 5 years ago · comments

My backend service has many interface, most of interfaces timeout threshold is 2s, but some of interfaces timeout threshold should be 5s.
The config of execution timeout(ExecutionConfig.Timeout) is static.

Can execution timeout be dynamic? Or Circuit.checkErrTimeout support check whether ret from runFunc func(context.Context) error is a timeout error, so that it can be collect by Circuit.CmdMetricCollector.ErrTimeout? Or any good idea? thanks~

Jack Lindamood · Answer 1 · Mon Sep 09 2019 01:11:42 GMT+0800 (China Standard Time)

If you want a different timeout per circuit, you can do that with https://github.com/cep21/circuit#configuration-factories. Notice how myFactory is a function that takes a circuit name. You can return any config you want for that circuit.

myFactory := func(circuitName string) circuit.Config

If you want the same circuit to have different timeouts, that's a bit strange. Usually people use a different circuit for different configurations.

Let me know if this helps.

roketyyang · Answer 2 · Mon Sep 09 2019 10:30:10 GMT+0800 (China Standard Time)

If use different circuit for different timeouts, the error metric cannot be shared, thus the ClosedToOpenFactory and OpenToClosedFactory config will be more complicate.
How about add an errTimeout interface in errors.go like BadRequest:

type ErrTimeout interface {
	ErrTimeout() bool
}

and check ret from runFunc func(context.Context) error in checkErrTimeout like checkErrBadRequest.

Jack Lindamood · Answer 3 · Mon Sep 09 2019 13:56:47 GMT+0800 (China Standard Time)

My backend service has many interface

So it's the same backend RPC service, but you want to talk to it with many different interfaces. Does each interface do a different thing?

I've always used different circuits for that. What if your backend service is only partially broken: where one of the functions it does fails but the others work fine. You wouldn't want to fail all calls to the service: just calls that are failing.

If use different circuit for different timeouts, the error metric cannot be shared

What system are you using to render metrics: prometheus, signalfx, datadog, etc? Most have a way to aggregate metrics.

and check ret from runFunc func(context.Context) error in checkErrTimeout

You're proposing that runFunc can return a custom error that the circuit considers the same as a timeout?

roketyyang · Answer 4 · Mon Sep 09 2019 19:16:00 GMT+0800 (China Standard Time)

Different interface has different business logic, some business logic takes more time. In my scenario, if one interface fail, the others almost will fail.
the error metric cannot be shared means circuit's internal err metric which decide whether to transfer to open state. I use prometheus+grafana to render metrics.
Yes, like BadRequest

Jack Lindamood · Answer 5 · Thu Sep 12 2019 02:24:40 GMT+0800 (China Standard Time)

I think the request is simple enough, but I hope I can convince you it's a bad idea :)

At my last place I worked, we had lots of site issues because a circuit was around too many things. People would ask "why did the whole site not work when only this tiny part was broken". The answer was that the person put a circuit around too many things.

In my scenario, if one interface fail, the others almost will fail.

Are you 100% sure, or just 99.9% sure? And are you 100% sure it will forever be like that, even 2 years from now? What if the backend service changes a year from now? What if it's reimplemented with a different datastore or a different way?

Here are two examples where we had a circuit around multiple endpoints and thought the same thing as you.

DynamoDB. We had a circuit around a dynamodb table. That seems like it fits your definition, right? Except one of the endpoints used a global secondary index and another did not. So a situation happened where we overloaded a global secondary index and caused all traffic to throttle for that DynamoDB table, when we only needed the queries that used the index to throttle.
MySQL: There was a circuit around a MySQL database. Not just a database, but the same table in the database. We also thought this was fine, except one of the endpoints did queries a bit differently than the other and were optimized by MySQL to use a different index. Even though the circuit was for the same table, we wish we had two circuits: one for each type of query. That way, only the really slow queries would start to get throttled.

I strongly encourage you to use more, not less, circuits. Yes, it takes a few more queries for the circuit to close, but that number is O(1). Besides future proofing, you also get individual metrics around each of these endpoints (from the circuit) which makes them easier to monitor.

Jack Lindamood · Answer 6 · Thu Sep 12 2019 02:27:59 GMT+0800 (China Standard Time)

means circuit's internal err metric which decide whether to transfer to open state

If you really wanted to do that, one way is to implement type type ClosedToOpen interface such that it is a singleton. Then, inside type GeneralConfig struct the function ClosedToOpenFactory func() ClosedToOpen could return the same object.

If you want more granular control, you could look at the name of the circuit inside type CommandPropertiesConstructor func(circuitName string) Config and for the names of circuits you want to share metrics for, return a config that all return the same ClosedToOpen

roketyyang · Answer 7 · Thu Sep 12 2019 21:48:31 GMT+0800 (China Standard Time)

Thank you for your advise. I will consider use two circuit with singleton of ClosedToOpen and OpenToClose.