cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page:https://cortexmetrics.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The No metric name label error doesn't specify metric

jakubgs opened this issue · comments

Describe the bug
I have started seeing this error on our distributors:

caller=logging.go:86
level=warn
msg="
  POST /api/v1/push (500) 81.24787ms
  Response: \"No metric name label\\n\"

Which comes from: https://github.com/cortexproject/cortex/blob/v1.16.0/pkg/util/extract/extract.go#L13C37-L13C57

I have not yet identified what is causing it because the error does not show which metric caused it. This means I have to find it essentially through trial-and-error by removing services one-by-one and hope I can identify which one is causing it

Expected behavior
The error indicates which metric caused it, allowing the administrator to fix the metric.

Environment:
Prometheus 2.50.1 sending to Cortex 1.16.0.

Also, notably, I have stopped a Prometheus instance I suspected was causing this and the errors indeed stopped, but then after I've restarted that Prometheus instance the errors did not return, which makes no sense to me:

image

Have you tried querying your Prometheus and find series with no metric name?

The error indicates which metric caused it, allowing the administrator to fix the metric.

I think this action item is reasonable. We can try to add this. Help wanted.

If you look at my comment here:

What I've experienced is No metric name label errors when my cluster was near to dying due to network traffic issues.
After restarting Prometheus instances the error would go away for no good reason. I'm not sure what's happening but it seems like high stress situations trigger something that results in those errors. Maybe seeing the metric that causes it in the error could lead to the reason why it only happens during high cluster latency situations.