neonvm: separate maps / counts for failure vs conflict
Omrigan opened this issue · comments
This might help to debug the issues when we have a lot of VM failing to reconcile. Although, it is unclear if repeated conflicts for the same VM is likely failure scenario.
Originally posted by @sharnoff in #920 (comment)
To add onto this, I think in particular, this would help with making our alerting more sensitive — having 10 minutes of >1 VM failing to reconcile may be expected as there's always something affected by conflicts; but having 10 minutes of >1 VM truly failing may not be expected.
Alternatively -- something I'd discussed as part of #757 is that we may be better off having metrics like "number of VMs failing reconcile for N seconds" or something — that's probably much easier to have higher-quality alerting for, rather than our gauge of binary "is it stuck" approach we currently have.