neondatabase / autoscaling

Postgres vertical autoscaling in k8s

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

neonvm: separate maps / counts for failure vs conflict

Omrigan opened this issue · comments

This might help to debug the issues when we have a lot of VM failing to reconcile. Although, it is unclear if repeated conflicts for the same VM is likely failure scenario.

Originally posted by @sharnoff in #920 (comment)

To add onto this, I think in particular, this would help with making our alerting more sensitive — having 10 minutes of >1 VM failing to reconcile may be expected as there's always something affected by conflicts; but having 10 minutes of >1 VM truly failing may not be expected.

Alternatively -- something I'd discussed as part of #757 is that we may be better off having metrics like "number of VMs failing reconcile for N seconds" or something — that's probably much easier to have higher-quality alerting for, rather than our gauge of binary "is it stuck" approach we currently have.