seglo / kafka-lag-exporter

Monitor Kafka Consumer Group Latency with Kafka Lag Exporter

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lag reported as NaN for low volume topics

rkrage opened this issue · comments

Not sure if this is expected behavior, but I've observed these NaN stats when a topic stops getting new messages and a consumer group is completely caught up. I'd expect the value to be zero in this case.

If it's helpful, I'm running 0.5.5 against Kafka 1.1.1

Might be related to #37

Might also be worth mentioning that if I restart the lag exporter, the value becomes zero.

Looking at the code, it seems like this shouldn't be happening:

https://github.com/lightbend/kafka-lag-exporter/blob/master/src/main/scala/com/lightbend/kafkalagexporter/LookupTable.scala#L76

It only returns TooFewPoints if there are less than two points in the lookup table. But it definitely seems like the table should contain at least two points for the same offset in this case:

https://github.com/lightbend/kafka-lag-exporter/blob/master/src/main/scala/com/lightbend/kafkalagexporter/LookupTable.scala#L33-L38

Is it possible we're hitting this case?

https://github.com/lightbend/kafka-lag-exporter/blob/master/src/main/scala/com/lightbend/kafkalagexporter/LookupTable.scala#L19-L20

Hi @rkrage. Thanks for the troubleshooting efforts. There are indeed some weird edge cases when extrapolating lag in time. It's been a challenge to satisfy all of them. The time metric kafka_consumergroup_group_lag_seconds metric will report NaN for several edge cases. Is this the metric you expect to see something different for? Or are you referring to the offset lag metric kafka_consumergroup_group_lag?

In either case, the best way to troubleshoot what's happening is to temporarily enable DEBUG logging so that you can see raw group and offset metadata Kafka Lag Exporter uses. See this comment for details on how to enable DEBUG.

`#106 (comment)

Hi @seglo, thanks for the response. Yes, I'm referring to the time metric here kafka_consumergroup_group_lag_seconds. I turned on debug logging and dumped the output as well as the stat values in this gist: https://gist.github.com/rkrage/03c730718b6d33e3de70f8b3e24ce61c

As you can see, once I turn off my test producer, time lag drops to zero initially, then changes to NaN in the next poll iteration (and stays that way until I start producing messages again).