The No metric name label error doesn't specify metric

Question

The No metric name label error doesn't specify metric

jakubgs opened this issue 3 months ago · comments

Describe the bug
I have started seeing this error on our distributors:

caller=logging.go:86
level=warn
msg="
  POST /api/v1/push (500) 81.24787ms
  Response: \"No metric name label\\n\"

Which comes from: https://github.com/cortexproject/cortex/blob/v1.16.0/pkg/util/extract/extract.go#L13C37-L13C57

I have not yet identified what is causing it because the error does not show which metric caused it. This means I have to find it essentially through trial-and-error by removing services one-by-one and hope I can identify which one is causing it

Expected behavior
The error indicates which metric caused it, allowing the administrator to fix the metric.

Environment:
Prometheus 2.50.1 sending to Cortex 1.16.0.

Jakub · Answer 1 · Wed Mar 06 2024 17:37:20 GMT+0800 (China Standard Time)

Also, notably, I have stopped a Prometheus instance I suspected was causing this and the errors indeed stopped, but then after I've restarted that Prometheus instance the errors did not return, which makes no sense to me:

Ben Ye · Answer 2 · Thu Mar 07 2024 00:38:50 GMT+0800 (China Standard Time)

Have you tried querying your Prometheus and find series with no metric name?

The error indicates which metric caused it, allowing the administrator to fix the metric.

I think this action item is reasonable. We can try to add this. Help wanted.

Jakub · Answer 3 · Thu Mar 07 2024 16:34:46 GMT+0800 (China Standard Time)

If you look at my comment here:

#5803 (comment)

What I've experienced is No metric name label errors when my cluster was near to dying due to network traffic issues.
After restarting Prometheus instances the error would go away for no good reason. I'm not sure what's happening but it seems like high stress situations trigger something that results in those errors. Maybe seeing the metric that causes it in the error could lead to the reason why it only happens during high cluster latency situations.