alexmv / kansas

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kansas

kansas is a custom reverse proxy which is used internally as part of Zulip. Its purpose is to sit between the nginx load-balancer and the Tornado-based real-time event system, and forward connections from clients to their appropriate back-end shards. It is not necessary unless there is more than one Tornado shard.

Constraints

  • The database is authoritative for where a realm's queues are.
  • Durability
    • In the event of a failure of the service, it should not result in dropping all queues on the floor.
    • Memcached is not durable, so is not enough to satisfy the previous criteria.
    • The default codepath should not require a database access.
  • We may want a concept of a "default" assignment which uses consistent hashing to not have to make an explicit choice about which backend to use.
    • However, not all realms are the same size; it is useful to be able to pin some large realms to be on different shards from each other.
    • This also allows us to slowly shift load off of a shard before shutting it down, in order to not shock-load the shards by suddenly moving a large number of queues.
  • Given that we want to be able to move queues between shards
    • We want to generally have all queues for a realm on the same shard, since that makes for a better common experience in case of shard outage, and minimizes messages on the rabbitmq (or eventually PostgreSQL) bus.
    • While sharding is configured at the realm level, it happens at the queue level. Which means that moving a realm requires locking and serializing all queues -- or relaxing the constraint that realms are only ever on one shard.
    • Enabling realms to move piecewise, and not atomically, also allows us to spread the load on the backend tornado instances for doing the queue serialization, at the cost of an extra message in the rabbitmq bus per shard that the realm is on. This is not a high cost.
    • The durable storage in the database knows nothing about queue-ids, however, which means that moves have to happen at the granularity of users, at very least.

Queue moves

Queue moves with page reloads

  • Commit change to the database
  • 400 with BAD_EVENT_QUEUE_ID anything which comes in which would go to the old shard
    • Can either do this invalidation at the kansas level, or by enqueuing something to the old Tornado shard which tells it to drop the relevant queues.

Transparent queue moves

  • Create queues in new shard
  • Commit change to database
  • Ensure nothing is writing to old shards using stale database data (..somehow?)
  • Ensure old shard has consumed all of the data
  • Pause requests from clients
  • Serialize queues in old shard
  • Send to new shard
  • Merge serialized data with anything which has been sent to the queue already
  • Resume requests, but to the new shard

Restarts of Kansas without connection drops

  • Open a UNIX socket to transfer state
  • Fork and exec the new binary
  • Old process writes the listen socket (port 9799) into the UNIX socket
  • New process reads the socket out and calls listen() but not accept()
  • New process sends a signal (e.g. USR1) to the old one that it has started and is ready
  • Old process stops closes the listen socket
  • Old process serializes the state of its internal state to the UNIX socket
  • New process reads the state
  • New process starts the accept() calls on the listen socket
  • Old process finishes any in-flight requests
  • Old process exits gracefully

Supervisor won't like this, because it wants to be the parent PID of what it monitors. We can work around that by making the first process supervisor runs be the one in change of mediating all of the above actions, so it never exits (unless its one child exits) and is the parent of the actual kansas programs which do request handling.

Additionally supervisorctl stop kansas will always do a full stop / start. As of Supervisor 3.2.0, you can supervisorctl signal SIGUSR1 or what-not, however.

Restarts of Tornado without connection drops

  • On connection close mid-request from tornado, or new request with dead tornado
  • 502 anything that's not a GET (e.g. DELETE's are not safe to manufacture)
  • Add socket to waiting queue for that shard
  • Wait until earlier of 50s or backend comes back
  • Respond with empty response, or maybe some synthetic "waiting" if hit 50s
  • Add rate-limit before popping next one off, to spread request load

TODO

  • Load-test
  • Garbage-collection of queues after 10 minutes, to match Tornado
  • Persist the listen socket, using https://github.com/tormol/uds
  • Persist the internal state
    • send in-memory store using serde over a socket
    • write into a local sqlite database
    • store in redis
    • just talk to postgres directly
  • Paper over Tornado restarts
  • Request tracing?

About


Languages

Language:Rust 91.7%Language:Python 7.3%Language:Shell 0.9%