memorysafety / river

This repository is the home of the River reverse proxy application, based on the pingora library from Cloudflare.

Home Page:https://www.memorysafety.org/initiative/reverse-proxy/

Repository from Github https://github.commemorysafety/riverRepository from Github https://github.commemorysafety/river

Notes for design: Rate Limiting

jamesmunns opened this issue · comments

As part of the current milestone, we'd like to add basic rate limiting options.

https://blog.nginx.org/blog/rate-limiting-nginx is a good basic read on how NGINX does this.

There's a couple of concepts we need to define:

  • The "key" - or what we use to match a request to the rate limiting rule
    • Could be source IP, request URI, or other items
    • TODO: This is probably another area where we want to define a common "format-like" syntax, also needed for load balancing with hash algorithms, so you can specify things like $src_ip/$uri or something
    • TODO: We probably need to define how to handle when a request could potentially match two keys, but PROBABLY the answer is "rules are applied in the order they are applicable, first match wins" - nginx applies all and takes the most restrictive result!
  • The "rule" - or what policies we follow to decide what to do with each request. The three outcomes of a rule are:
    • Forward immediately - allow the request to continue immediately
    • Delay - don't serve the request immediately, but wait for some amount of time (more on this later) before forwarding
    • Reject - Immediately respond with a 503 or similar "too much" error response
  • The "Rate", which actually has multiple components, if we are implementing a leaky bucket style of rate limiting (what NGINX does)
    • The "active in-flight" count - how many outstanding forwarded requests can we have at one time?
    • The "delayed in-flight" count - how many requests will we hold on to at one time before rejecting?
    • The "delay to active promotion rate" - how often do we "pop" a delayed request to the "active" queue?
    • NOTE: These are SLIGHTLY different than the rate, burst, and delay terms from nginx!

Unlike NGINX, I don't currently think we need to consider the "zone" or "scope" of the rules - I currently intend for rate limiting to be per-service, which means that the "zone" is essentially the tuple of (service, key)

I have checked with @eaufavor, and the correct way to handle delayed requests in pingora is to have the task yield (for a timer, or activation queue, etc).

Maybe you are also interested to look into Introduction to Traffic Shaping Using HAProxy which have similar features as nginx.

Traffic shaping is available since https://www.haproxy.com/blog/announcing-haproxy-2-7

@git001 thanks for the pointer, I'll check it out!

An additional detail that was surfaced during discussion was that there are two main "orientations" for rate limiting:

  • We want to limit the downstream connections, in order to limit individual peers from making excess requests
  • We want to limit the upstream connections, to prevent individual proxied servers from being overwhelmed

This could maybe be implemented in a way that is transparent: If we use the proposed formatting key for the matching, this could be independently matched against, for example if we have three "rules":

  • src -> The source IP of the downstream requestor
  • uri -> The request URI path
  • dst -> The selected upstream path

In this case, I think that all connections would need to obtain a token from ALL THREE to proceed.

So lets say that downstream 10.0.0.1 wants to request for /api/whatever, and upstream 192.0.0.1 is selected as the upstream.

For each of these keys, we'd need to:

  • Make sure that the key exists in the global table of rate limiters
    • If the key does not exist, create it
  • Get a handle for each of the rate limiters
  • Attempt to enqueue ourselves in the list for each:
    • If any of them are "overfilled", then immediately bail
    • Otherwise begin awaiting on all rate limiters
  • Once all rate limiting is complete, then issue the request

My initial thought for this is to use something like the leaky_bucket crate, and place the rate limiting table in something like:

// things that can be rate limiting keys
enum Key {
    Ipv4(IpV4Addr),
    Format(String),
    ...
}

struct LimiterPayload {
    // Some kind of record of last use? For culling?
    last_used: AtomicU64,
    // The actual rate limiter
    limiter: RateLimiter,
}

struct Context {
    limiter_map: RwLock<BTreeMap<Key, Arc<LimiterPayload>>>,
}

The behavior would have to be something like:

let mut limiters = vec![];
for rule in self.rules.iter() {
    let key = generate_key(&request_context);
    // This probably could optimistically get a read lock, then upgrade to a write lock
    // if the key does not exist
    let limiter = service_context.get_or_create(key);
    limiters.push(limiter);
}

let mut pending_tokens = vec![];
for limiter in limiters {
    match limiter.get_token() {
        Ok(None) => {}, // immediately allowed
        Ok(Some(fut)) => {
            // Not immediately allowed
            pending_tokens.push(fut);
        }
        Err(e) => {
            // too full, immediately return error
            return Err(400);
        }
    }
}

// futures unordered or something
pending_tokens.join_all().await;

Ok(...)

But this is immediately problematic:

  • Currently, the leaky_bucket crate has no concept of "immediate fail" - it always enqueues all requests. This could be okay, but not what I wrote above
  • We do need some way to de-allocate unused limiters, either at some age basis, or on a "least recently used" basis in case we end up with a lot in a very short time (think scraping or something that hits a ton of unique endpoints)
  • I'm unsure how we should handle the "temporal differences" in sub-matches
    • If we get a token for one match now, but the next rule doesn't resolve for 30s, does that make sense?
    • If we get a token for one match now, but then the next is "immediate fail", we don't get credit back for the first.

It also feels like a lot of locking and allocation. This is probably unavoidable, but it makes me a little nervous about being a DoS vector.

Closing this as completed in #67