Stampede not working for phpredis > 4.0.0

Question

Stampede not working for phpredis > 4.0.0

ValiNiculae opened this issue 4 years ago · comments

Hi!

There seems to be an issue with the StampedeProtector since phpredis 4.0.0.
The exists method now returns an int instead of TRUE/FALSE: https://github.com/phpredis/phpredis#exists

Another issue I found is with the TTL of the stampede protect. We set it in milliseconds, in the constructor, but when we do: $success[$key] = $this->cache->add($this->stampedeKey($key), '', $this->sla); (in StampedeProtector@protect) is set in seconds.

I will open a pull request to fix/discuss these issues.

Thanks!

Matthias Mullie · Answer 1 · Thu Nov 10 2022 04:19:51 GMT+0800 (China Standard Time)

Thanks for spotting this, and apologies for not having seen this & responding earlier.

There was, however, a fix submitted long ago WRT the change in exists return value: 31f4578 (in a way that remains compatible with older Redis versions)

And another nice catch with the StampedeProtector TTL! It's a bit of an annoying one to land, because there's no way to "fix it right".

Changing the constructor from milliseconds to seconds may have a big effect for those who already pass in a milliseconds value. For the most part, this is probably not a huge deal. The locks will be there for a (too) long time, but once another process has completed & stored the value, the long lock becomes irrelevant. I.e., most currently don't ever notice.
If we were to change that to seconds, their current milliseconds-input may have a significant effect: they would now find their code sleep for a really long time before polling again (e.g. 100 seconds rather than 0.1), and may end up with too many concurrent processes. Basically: we can't move towards seconds without expecting users to update their calling code, so this would effectively be a breaking change.

OTOH, sticking with milliseconds is simply wrong.

I'll merge the fix now that I'm about to roll out a new major release.

Matthias Mullie · Answer 2 · Thu Nov 10 2022 17:17:24 GMT+0800 (China Standard Time)

After thinking things over some more last night, I think we should stick with milliseconds.
2 things are affected by that time:

it determines how long a protective lock is kept (to signal to other processes that the value is already being worked on)
it determines how long these other processes "sleep" in order to wait for that value to become available

Seconds have no benefit over milliseconds, but there is one the other way around: it becomes possible to acquire a shorter lock/protective time. The "ideal" lock time is essentially determined by 2 variable factors:

how fast another process is able to compute & cache the new value (too short of a lock and the stampede goes through)
how many other processes can remain open simultaneously until the value becomes available (too long = too many processes & they fail) - while those processes are mostly just idling (which is still a step up compared to no stampede protected - at least they're not all doing intensive work), there may end up being too many of them, depending on available infrastructure (e.g. Apache2 defaults to only allowing 150 concurrent connections)

When a stampede happens, it is very likely that there's a high number of incoming requests. And the average response time for a relatively responsive application is usually not over a couple hundred milliseconds. Ergo, being able to set a sub-second TTL may be important to achieve a good balance between both factors in certain applications. Usually, a 1-or-more-seconds TTL will be just fine, but that will remain possible with milliseconds support as well.

But the catch: we can't acquire a sub-second lock.
The only thing we can do is basically rounding up to the nearest second. And... that's just fine!

Imagine a sub-second lock TTL (e.g. 500ms).

0ms: Request 1 comes in, data is not available, creates a lock for 1s.
25ms: Request 2 comes, data is not available, but there's a lock. Waits a bit.
75ms: Request 2 polls the cache; still not available. Waits a bit.
125ms: Request 2 polls the cache; still not available. Waits a bit.
150ms: Request 3 comes in, data is not available, but there's a lock. Waits a bit.
175ms: Request 2 polls the cache; still not available. Waits a bit.
200ms: Request 3 polls the cache; still not available. Waits a bit.
210ms: Request 1 completes & stores data in cache (lock remains). Wraps out remaining work & completes.
225ms: Request 2 polls the cache; data is now available. Wraps up remaining work & completes.
250ms: Request 3 polls the cache; data is now available. Wraps up remaining work & completes.
400ms: Request 4 comes in, data is available. Wraps up remaining work & completes. Lock is still around, but didn't matter in this case.
1000ms: Lock disappears.

The only case where "the lock sticking around for longer than it was supposed to" becomes a problem, is when request 1 (the one that created the lock & was supposed to store the new data) didn't complete its job (e.g. crashed).
If that happens, all new requests coming in are still assuming that some other process is working on it (because there is a lock file), when it fact that's not the case. It would be better for the lock file to be removed, so that another process can pick up the work.
That doesn't change with moving from milliseconds to seconds, though; then, too, those other processes would be stuck waiting out the remainder of the second.

(Of course, it's worse in the current incorrect implementation - the lock is held for much, much longer; request 1 not being able to fulfill its job has longer-lasting impact, and that needs to be fixed)

I'm going to stick with a milliseconds SLA, but will fix the lock time so that it's holding it for the "correct" (rounding up the milleseconds to a second) time.

Does that make sense?