Bug: scheduler's critically delayed event processing has positive feedback loop
sharnoff opened this issue · comments
Environment
Prod (occurred twice recently)
Steps to reproduce
Hard to reproduce locally - requires odd circumstances under a lot of load. AFAICT we've only seen it occur at startup — probably because that's when there's the most stuff going on.
The general idea of the triggering behavior is this:
- For some reason, event processing in the scheduler gets delayed (whether that's because of weird behavior at startup, or some other failure mode)
- Some VM pod start events are delayed enough that the VMs are deleted before the events are handled
- While handling those events:
- We don't see the VM in the local VM store (because it doesn't exist)
- Thinking the store is out of date, we
Relist()
— this can take ~2s - After relisting, the VM is still not in the store, so we return error
- Because relisting takes so long, more events get delayed, so we handle more pod start events after the VM was deleted, so we cause even more delays
Note that this is also because we only handle a single event at a time, so waiting for 2s handling one event holds up the entire queue.
(Originally, that's because otherwise we'd have to be careful to avoid out of order start/stop events - there's ways around this, though).
Other logs, links
Tasks
Skipping duplicate Relist()
s is somewhat complex, but is provably a solution here. Processing events in parallel allows us to make the problem small enough that we never reach the critical threshold of a positive feedback loop.
Partial reoccurence here, I think: https://neondb.slack.com/archives/C03F5SM1N02/p1710952126841459
(delayed event handling, but not critically so)
Done via #863