cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The 'alertmanager_max_alerts_count' is not functioning properly

damnever opened this issue · comments

Describe the bug

Due to the race condition in Alertmanager, the alertmanager_max_alerts_count is not functioning properly, the cortex_alertmanager_alerts_limiter_current_alerts may keep increasing until alerts become limited. To address this, I have sent a patch here: prometheus/alertmanager#3648

To Reproduce

See prometheus/alertmanager#3648

Expected behavior

Additional Context

I have noticed that Alertmanager has now removed support for the v1 API. Perhaps we should also consider deprecating support for the v1 API.

The temporary fix is as follows:

--- a/pkg/alertmanager/alertmanager.go
+++ b/pkg/alertmanager/alertmanager.go
@@ -615,7 +615,6 @@ type alertsLimiter struct {
 
 	mx        sync.Mutex
 	sizes     map[model.Fingerprint]int
-	count     int
 	totalSize int
 }
 
@@ -664,7 +663,8 @@ func (a *alertsLimiter) PreStore(alert *types.Alert, existing bool) error {
 	a.mx.Lock()
 	defer a.mx.Unlock()
 
-	if !existing && countLimit > 0 && (a.count+1) > countLimit {
+	_, existing = a.sizes[fp]
+	if !existing && countLimit > 0 && len(a.sizes)+1 > countLimit {
 		a.failureCounter.Inc()
 		return fmt.Errorf(errTooManyAlerts, countLimit)
 	}
@@ -692,11 +692,7 @@ func (a *alertsLimiter) PostStore(alert *types.Alert, existing bool) {
 	a.mx.Lock()
 	defer a.mx.Unlock()
 
-	if existing {
-		a.totalSize -= a.sizes[fp]
-	} else {
-		a.count++
-	}
+	a.totalSize -= a.sizes[fp]
 	a.sizes[fp] = newSize
 	a.totalSize += newSize
 }
@@ -713,14 +709,13 @@ func (a *alertsLimiter) PostDelete(alert *types.Alert) {
 
 	a.totalSize -= a.sizes[fp]
 	delete(a.sizes, fp)
-	a.count--
 }
 
 func (a *alertsLimiter) currentStats() (count, totalSize int) {
 	a.mx.Lock()
 	defer a.mx.Unlock()
 
-	return a.count, a.totalSize
+	return len(a.sizes), a.totalSize
 }

@qinxx108 @alvinlin123 Would you mind taking a look at this issue?

Hi @damnever is the temporary fix a clean up and real fix is in prometheus repo?

@qinxx108 yes, the real fix is in the Prometheus repo.