Bug: scheduler's critically delayed event processing has positive feedback loop

Question

Bug: scheduler's critically delayed event processing has positive feedback loop

sharnoff opened this issue 4 months ago · comments

Environment

Prod (occurred twice recently)

Steps to reproduce

Hard to reproduce locally - requires odd circumstances under a lot of load. AFAICT we've only seen it occur at startup — probably because that's when there's the most stuff going on.

The general idea of the triggering behavior is this:

For some reason, event processing in the scheduler gets delayed (whether that's because of weird behavior at startup, or some other failure mode)
Some VM pod start events are delayed enough that the VMs are deleted before the events are handled
While handling those events:
1. We don't see the VM in the local VM store (because it doesn't exist)
2. Thinking the store is out of date, we Relist() — this can take ~2s
3. After relisting, the VM is still not in the store, so we return error
Because relisting takes so long, more events get delayed, so we handle more pod start events after the VM was deleted, so we cause even more delays

Note that this is also because we only handle a single event at a time, so waiting for 2s handling one event holds up the entire queue.

(Originally, that's because otherwise we'd have to be careful to avoid out of order start/stop events - there's ways around this, though).

Other logs, links

Tasks

Beta Give feedback

plugin: Deduplicate Relist()s triggered by VM pod start events #853
plugin: Use multiple event handler threads #854
VM info should be available on runner pod, so scheduler doesn't need to watch VMs #863

c/autoscaling/neonvm c/autoscaling/scheduler t/feature
Options

Skipping duplicate Relist()s is somewhat complex, but is provably a solution here. Processing events in parallel allows us to make the problem small enough that we never reach the critical threshold of a positive feedback loop.

Em Sharnoff · Answer 1 · Thu Mar 21 2024 01:26:27 GMT+0800 (China Standard Time)

Partial reoccurence here, I think: https://neondb.slack.com/archives/C03F5SM1N02/p1710952126841459

(delayed event handling, but not critically so)

Em Sharnoff · Answer 2 · Fri Mar 22 2024 07:41:38 GMT+0800 (China Standard Time)

Assigning @Omrigan and removing myself to reflect that remaining work will be via #863 (and #865), rather than #853.

Oleg Vasilev · Answer 3 · Tue Mar 26 2024 05:45:44 GMT+0800 (China Standard Time)

Done via #863