Handle presence.enter() retries in a special way

Question

Handle presence.enter() retries in a special way

KacperKluka opened this issue a year ago · comments

It turns out that the general definition of fatal errors does not work well for the presence.enter() operation. That's because when the connection is suspended the operation can return a 4xx error that should not be treated as fatal. Therefore, we should apply a special error-handling logic while working with ably.connect() (which tries to enter presence) and other presence.enter() operations. This should be applied at least in the AddTrackableWorker and the RetryEnterPresenceWorker workers.

@paddybyers mentioned that there are 2 rare errors that should be treated as fatal:

authentication token revoked
presence enter limit exceeded

In those cases, we should probably treat the error as fatal and not retry the operation.

@jaley @paddybyers please correct me or add more details if needed 🙇

Lawrence Forooghian · Answer 1 · Tue Jan 31 2023 21:04:14 GMT+0800 (China Standard Time)

@KacperKluka Given that in #973 we're saying that we won't attempt to perform Ably operations (which presumably includes presence.enter()) whilst the connection is suspended, how might we expect the scenario described here to be triggered?

KacperKluka · Answer 2 · Tue Jan 31 2023 21:06:39 GMT+0800 (China Standard Time)

I guess the connection can change while the operation is running, right? Then the error that we receive might be one of the fatal or non-fatal ones and we should know whether to retry the enter() or not 🤔

Lawrence Forooghian · Answer 3 · Tue Jan 31 2023 21:19:13 GMT+0800 (China Standard Time)

Good point.

Lawrence Forooghian · Answer 4 · Tue Jan 31 2023 21:25:30 GMT+0800 (China Standard Time)

So, if I've understood correctly, we're saying that the presence enter operation should be retried whenever it results in a 4xx or 5xx error, unless it's one of the following:

40141: token revoked
"presence enter limit exceeded"; not sure which error this refers to, is it "91003": "unable to enter presence channel (maximum member limit exceeded)"?

@paddybyers would you be able to confirm please?

Paddy Byers · Answer 5 · Tue Jan 31 2023 21:36:48 GMT+0800 (China Standard Time)

No, sorry, that's not what I said.

5xx errors are always retriable.
4xx errors are non-retriable, and should be considered fatal except for operations (publlish/enter) that fail as a result of being suspended. These need to be retried, but only after being connected again.

What I said this morning is that there aren't very many 4xx fatal errors that we would realistically expect to encounter in practice. Possible causes of a real non-retriable error are:

the credentials have been revoked;
presence enter() fails because the presence limit for the channel have been exceeded.

The other thing to bear in mind is what it "non-retriable" means. If a publish attempt is rejected as a result of a rate limit, for example - this will be a 429 status code - this means you shouldn't retry that operation. But it doesn't mean that the trackable itself is broken. So you need to understand that indication of a failure implicitly has a scope - it means that a specific operation failed, not that all operations will fail.

There is a final exception to the 4xx rule, which relates to token expiry - an error with a 401 status code, and a code in the range 40140..40160 - the ably-java library will attempt to resolve it by going through a cycle of token renewal, and ably-java will retry. If the error propagates back to AAT, it means that that it failed even after the token renewal, so AAT shouldn't itself retry.

Lawrence Forooghian · Answer 6 · Thu Feb 02 2023 21:23:40 GMT+0800 (China Standard Time)

The other thing to bear in mind is what it "non-retriable" means. If a publish attempt is rejected as a result of a rate limit, for example - this will be a 429 status code - this means you shouldn't retry that operation. But it doesn't mean that the trackable itself is broken. So you need to understand that indication of a failure implicitly has a scope - it means that a specific operation failed, not that all operations will fail.

I'm struggling to understand this. In the case where a publish is rejected due to a rate limit, then one of these two things is true:

We can continue trying to publish location updates at any moment – in which case, wouldn't it be fine to retry the failed publish?
There is a period of time in which all location update publishes are going to fail – in which case, does that not mean that the publisher is “broken”?

Paddy Byers · Answer 7 · Thu Feb 02 2023 21:27:19 GMT+0800 (China Standard Time)

We can continue trying to publish location updates at any moment – in which case, wouldn't it be fine to retry the failed publish?

If you're hitting a rate limit, then you don't want to be retrying individual location updates - all that will do is increase the rate of publication. Because every location update will be superseded after a few seconds, it's fine to discard the update that failed, and send the next one (with the added skipped locations).

2. There is a period of time in which all location update publishes are going to fail – in which case, does that not mean that the publisher is “broken”?

When a limit is enforced, then some or all attempted operations will fail. The publisher has to have a sensible policy for how and what it retries, but the problem isn't with the publisher - it's with the limits on the account. I don't think the publisher is broken in this case.

Lawrence Forooghian · Answer 8 · Thu Feb 02 2023 21:29:53 GMT+0800 (China Standard Time)

There is a final exception to the 4xx rule, which relates to token expiry - an error with a 401 status code, and a code in the range 40140..40160 - the ably-java library will attempt to resolve it by going through a cycle of token renewal, and ably-java will retry. If the error propagates back to AAT, it means that that it failed even after the token renewal, so AAT shouldn't itself retry.

Sorry, I’m also confused by this one. You're saying that the (401 status code + code in range 40140..40160) error should be considered non-retriable, right? Why do you consider this an exception to the 4xx rule, which says that "4xx errors are non-retriable, and should be considered fatal except for operations (publlish/enter) that fail as a result of being suspended"?

Paddy Byers · Answer 9 · Thu Feb 02 2023 21:42:06 GMT+0800 (China Standard Time)

Why do you consider this an exception to the 4xx rule

Sorry for not being clearer.

From the AAT POV it's not an exception.

From the POV of the ably library spec it is an exception, because there is a retry following a token renewal cycle (for example in https://sdk.ably.com/builds/ably/specification/main/features/#RTN14b)

Lawrence Forooghian · Answer 10 · Thu Feb 02 2023 21:46:28 GMT+0800 (China Standard Time)

Ah, got it, thanks! So it sounds like the only work that needs doing for this issue is to make sure that we retry (upon the channel becoming attached again) a publish / presence operation if it fails due to the channel being suspended, is that right?

How would you suggest that we detect this error case? Is it by looking for this error?

Lawrence Forooghian · Answer 11 · Thu Feb 02 2023 23:05:51 GMT+0800 (China Standard Time)

@davyskiba is adding handling for a presence enter error in #981. As for publish, there's an ongoing conversation here about how we want to handle it: #981 (comment).