neondatabase / autoscaling

Postgres vertical autoscaling in k8s

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug: scheduler's critically delayed event processing has positive feedback loop

sharnoff opened this issue · comments

Environment

Prod (occurred twice recently)

Steps to reproduce

Hard to reproduce locally - requires odd circumstances under a lot of load. AFAICT we've only seen it occur at startup — probably because that's when there's the most stuff going on.

The general idea of the triggering behavior is this:

  1. For some reason, event processing in the scheduler gets delayed (whether that's because of weird behavior at startup, or some other failure mode)
  2. Some VM pod start events are delayed enough that the VMs are deleted before the events are handled
  3. While handling those events:
    1. We don't see the VM in the local VM store (because it doesn't exist)
    2. Thinking the store is out of date, we Relist() — this can take ~2s
    3. After relisting, the VM is still not in the store, so we return error
  4. Because relisting takes so long, more events get delayed, so we handle more pod start events after the VM was deleted, so we cause even more delays

Note that this is also because we only handle a single event at a time, so waiting for 2s handling one event holds up the entire queue.

(Originally, that's because otherwise we'd have to be careful to avoid out of order start/stop events - there's ways around this, though).

Other logs, links

Tasks

Tasks

  1. c/autoscaling/neonvm c/autoscaling/scheduler t/feature
    Omrigan

Skipping duplicate Relist()s is somewhat complex, but is provably a solution here. Processing events in parallel allows us to make the problem small enough that we never reach the critical threshold of a positive feedback loop.

Partial reoccurence here, I think: https://neondb.slack.com/archives/C03F5SM1N02/p1710952126841459

(delayed event handling, but not critically so)

Assigning @Omrigan and removing myself to reflect that remaining work will be via #863 (and #865), rather than #853.

Done via #863