Rate limiting for S3 compatible block storage

Question

Rate limiting for S3 compatible block storage

jakubgs opened this issue 2 months ago · comments

Describe the bug
We are using DigitalOcean Spaces which is S3-compatible storage solution for storing metrics. This service limits number of GET requests one can make to 800 per second. In situations where cache is full we have seen errors like these:

ts=2024-03-18T21:42:10.696053983Z caller=bucket_client.go:135 level=error
msg="bucket operation fail after retries" err="503 Slow Down"
operation="GetRange fake/01HRSSQ403WA1RD7WX20X7E9KX/index (off: 113583688, length: 6568)"

Which means the limit of 800 requests per second has been reached.

Expected behavior
According to DigitalOcean support the correct behavior would be something like this:

you can pause for 0.5s-1s after sending 200-300 requests which will surely help with this particular limit

The question is, would it make more sense to rate-limit requests being made to block storage rather than hit the limit and have to back off from making reqests for longer? Or is hitting the backoff the correct and simpler way to handle this?

Ben Ye · Answer 1 · Mon Mar 25 2024 18:13:18 GMT+0800 (China Standard Time)

Thanks for reporting the issue.

The question is, would it make more sense to rate-limit requests being made to block storage rather than hit the limit and have to back off from making reqests for longer? Or is hitting the backoff the correct and simpler way to handle this?

I understand that the current behavior is not ideal. However, this is not an easy problem to solve since Cortex has multiple microservices and multiple replicas sending requests to the object storage at the same time. Thus, it is pretty hard to do rate limiting at client side since what you actually need is a global rate limiter across all your Cortex pods.

From the error log provided, did you hit the rate limit from Store Gateway or other components? For most of the components I believe backoff and retry should be fine since they are not that latency sensitive.

Friedrich Gonzalez · Answer 2 · Tue Mar 26 2024 05:41:57 GMT+0800 (China Standard Time)

@jakubgs if you are using the mixin for cortex, there is a dashboard for object storage that shows which component is making the requests.

Something like this:

Along with a error/rate dashboards, etc.

if you have it, I would like to see them to understand what components and what operations are getting errors.

Jakub · Answer 3 · Tue Mar 26 2024 16:48:47 GMT+0800 (China Standard Time)

From the error log provided, did you hit the rate limit from Store Gateway or other components? For most of the components I believe backoff and retry should be fine since they are not that latency sensitive.

That's correct, the log is from a host running 3 services as one node: querier, compactor, store-gateway.

@jakubgs if you are using the mixin for cortex, there is a dashboard for object storage that shows which component is making the requests.

Sorry, I don't know what "mixin" is in this context.

Friedrich Gonzalez · Answer 4 · Tue Mar 26 2024 23:04:48 GMT+0800 (China Standard Time)

Sorry, I don't know what "mixin" is in this context.

the cortex mixin contains dashboards and alerts, you can find it the latest in https://github.com/cortexproject/cortex-jsonnet/releases

Jakub · Answer 5 · Tue Mar 26 2024 23:45:56 GMT+0800 (China Standard Time)

Oh, no, I have my own dashboard. What is the metric name?

Friedrich Gonzalez · Answer 6 · Wed Mar 27 2024 00:07:33 GMT+0800 (China Standard Time)

https://github.com/cortexproject/cortex-jsonnet/blob/main/cortex-mixin/dashboards/object-store.libsonnet

^ look in there

Jakub · Answer 7 · Wed Mar 27 2024 16:19:37 GMT+0800 (China Standard Time)

There's not much happening honestly:

And yet I see the errors in the query node logs(querier,compactor,store-gateway):

jakubgs@query-01.do-ams3.metrics.hq:~ % j --since '1 day ago' -ocat -u cortex --grep '503 Slow Down' | wc -l
859

Jakub · Answer 8 · Wed Mar 27 2024 16:20:20 GMT+0800 (China Standard Time)

I can see an interesting spike last month:

But that's not really relevant since I still see errors today.

Jakub · Answer 9 · Wed Mar 27 2024 16:25:56 GMT+0800 (China Standard Time)

Actually there's 859 errors in the last hour:

jakubgs@query-01.do-ams3.metrics.hq:~ % j -ocat -u cortex --since '1 hour ago' --grep '503 Slow Down' | wc -l
859

But the graph shows a low number of requests:

Seems wrong.

Friedrich Gonzalez · Answer 10 · Thu Mar 28 2024 00:20:29 GMT+0800 (China Standard Time)

thanks for sharing, looks like you don't have that many requests, to be honest. The ones that are concerning are the querier and store-gateway errors.

To reduce queries to block-storage make sure you have:

bucket index enabled
Enough caching configured. Cortex can use 4 types of caches, you want all 4 enabled.