The 'alertmanager_max_alerts_count' is not functioning properly
damnever opened this issue · comments
Describe the bug
Due to the race condition in Alertmanager, the alertmanager_max_alerts_count
is not functioning properly, the cortex_alertmanager_alerts_limiter_current_alerts
may keep increasing until alerts become limited. To address this, I have sent a patch here: prometheus/alertmanager#3648
To Reproduce
See prometheus/alertmanager#3648
Expected behavior
Additional Context
I have noticed that Alertmanager has now removed support for the v1 API. Perhaps we should also consider deprecating support for the v1 API.
The temporary fix is as follows:
--- a/pkg/alertmanager/alertmanager.go
+++ b/pkg/alertmanager/alertmanager.go
@@ -615,7 +615,6 @@ type alertsLimiter struct {
mx sync.Mutex
sizes map[model.Fingerprint]int
- count int
totalSize int
}
@@ -664,7 +663,8 @@ func (a *alertsLimiter) PreStore(alert *types.Alert, existing bool) error {
a.mx.Lock()
defer a.mx.Unlock()
- if !existing && countLimit > 0 && (a.count+1) > countLimit {
+ _, existing = a.sizes[fp]
+ if !existing && countLimit > 0 && len(a.sizes)+1 > countLimit {
a.failureCounter.Inc()
return fmt.Errorf(errTooManyAlerts, countLimit)
}
@@ -692,11 +692,7 @@ func (a *alertsLimiter) PostStore(alert *types.Alert, existing bool) {
a.mx.Lock()
defer a.mx.Unlock()
- if existing {
- a.totalSize -= a.sizes[fp]
- } else {
- a.count++
- }
+ a.totalSize -= a.sizes[fp]
a.sizes[fp] = newSize
a.totalSize += newSize
}
@@ -713,14 +709,13 @@ func (a *alertsLimiter) PostDelete(alert *types.Alert) {
a.totalSize -= a.sizes[fp]
delete(a.sizes, fp)
- a.count--
}
func (a *alertsLimiter) currentStats() (count, totalSize int) {
a.mx.Lock()
defer a.mx.Unlock()
- return a.count, a.totalSize
+ return len(a.sizes), a.totalSize
}
@qinxx108 @alvinlin123 Would you mind taking a look at this issue?
Hi @damnever is the temporary fix a clean up and real fix is in prometheus repo?
@qinxx108 yes, the real fix is in the Prometheus repo.