Notes for design: Rate Limiting
jamesmunns opened this issue · comments
As part of the current milestone, we'd like to add basic rate limiting options.
https://blog.nginx.org/blog/rate-limiting-nginx is a good basic read on how NGINX does this.
There's a couple of concepts we need to define:
- The "key" - or what we use to match a request to the rate limiting rule
- Could be source IP, request URI, or other items
- TODO: This is probably another area where we want to define a common "format-like" syntax, also needed for load balancing with hash algorithms, so you can specify things like
$src_ip/$uri
or something - TODO: We probably need to define how to handle when a request could potentially match two keys, but PROBABLY the answer is "rules are applied in the order they are applicable, first match wins" - nginx applies all and takes the most restrictive result!
- The "rule" - or what policies we follow to decide what to do with each request. The three outcomes of a rule are:
- Forward immediately - allow the request to continue immediately
- Delay - don't serve the request immediately, but wait for some amount of time (more on this later) before forwarding
- Reject - Immediately respond with a 503 or similar "too much" error response
- The "Rate", which actually has multiple components, if we are implementing a leaky bucket style of rate limiting (what NGINX does)
- The "active in-flight" count - how many outstanding forwarded requests can we have at one time?
- The "delayed in-flight" count - how many requests will we hold on to at one time before rejecting?
- The "delay to active promotion rate" - how often do we "pop" a delayed request to the "active" queue?
- NOTE: These are SLIGHTLY different than the
rate
,burst
, anddelay
terms from nginx!
Unlike NGINX, I don't currently think we need to consider the "zone" or "scope" of the rules - I currently intend for rate limiting to be per-service, which means that the "zone" is essentially the tuple of (service, key)
I have checked with @eaufavor, and the correct way to handle delayed requests in pingora is to have the task yield (for a timer, or activation queue, etc).
Maybe you are also interested to look into Introduction to Traffic Shaping Using HAProxy which have similar features as nginx.
Traffic shaping is available since https://www.haproxy.com/blog/announcing-haproxy-2-7
@git001 thanks for the pointer, I'll check it out!
An additional detail that was surfaced during discussion was that there are two main "orientations" for rate limiting:
- We want to limit the downstream connections, in order to limit individual peers from making excess requests
- We want to limit the upstream connections, to prevent individual proxied servers from being overwhelmed
This could maybe be implemented in a way that is transparent: If we use the proposed formatting key for the matching, this could be independently matched against, for example if we have three "rules":
src
-> The source IP of the downstream requestoruri
-> The request URI pathdst
-> The selected upstream path
In this case, I think that all connections would need to obtain a token from ALL THREE to proceed.
So lets say that downstream 10.0.0.1
wants to request for /api/whatever
, and upstream 192.0.0.1
is selected as the upstream.
For each of these keys, we'd need to:
- Make sure that the key exists in the global table of rate limiters
- If the key does not exist, create it
- Get a handle for each of the rate limiters
- Attempt to enqueue ourselves in the list for each:
- If any of them are "overfilled", then immediately bail
- Otherwise begin awaiting on all rate limiters
- Once all rate limiting is complete, then issue the request
My initial thought for this is to use something like the leaky_bucket
crate, and place the rate limiting table in something like:
// things that can be rate limiting keys
enum Key {
Ipv4(IpV4Addr),
Format(String),
...
}
struct LimiterPayload {
// Some kind of record of last use? For culling?
last_used: AtomicU64,
// The actual rate limiter
limiter: RateLimiter,
}
struct Context {
limiter_map: RwLock<BTreeMap<Key, Arc<LimiterPayload>>>,
}
The behavior would have to be something like:
let mut limiters = vec![];
for rule in self.rules.iter() {
let key = generate_key(&request_context);
// This probably could optimistically get a read lock, then upgrade to a write lock
// if the key does not exist
let limiter = service_context.get_or_create(key);
limiters.push(limiter);
}
let mut pending_tokens = vec![];
for limiter in limiters {
match limiter.get_token() {
Ok(None) => {}, // immediately allowed
Ok(Some(fut)) => {
// Not immediately allowed
pending_tokens.push(fut);
}
Err(e) => {
// too full, immediately return error
return Err(400);
}
}
}
// futures unordered or something
pending_tokens.join_all().await;
Ok(...)
But this is immediately problematic:
- Currently, the
leaky_bucket
crate has no concept of "immediate fail" - it always enqueues all requests. This could be okay, but not what I wrote above - We do need some way to de-allocate unused limiters, either at some age basis, or on a "least recently used" basis in case we end up with a lot in a very short time (think scraping or something that hits a ton of unique endpoints)
- I'm unsure how we should handle the "temporal differences" in sub-matches
- If we get a token for one match now, but the next rule doesn't resolve for 30s, does that make sense?
- If we get a token for one match now, but then the next is "immediate fail", we don't get credit back for the first.
It also feels like a lot of locking and allocation. This is probably unavoidable, but it makes me a little nervous about being a DoS vector.
Closing this as completed in #67