Lag reported as NaN for low volume topics
rkrage opened this issue · comments
Not sure if this is expected behavior, but I've observed these NaN
stats when a topic stops getting new messages and a consumer group is completely caught up. I'd expect the value to be zero in this case.
If it's helpful, I'm running 0.5.5
against Kafka 1.1.1
Might be related to #37
Might also be worth mentioning that if I restart the lag exporter, the value becomes zero.
Looking at the code, it seems like this shouldn't be happening:
It only returns TooFewPoints
if there are less than two points in the lookup table. But it definitely seems like the table should contain at least two points for the same offset in this case:
Is it possible we're hitting this case?
Hi @rkrage. Thanks for the troubleshooting efforts. There are indeed some weird edge cases when extrapolating lag in time. It's been a challenge to satisfy all of them. The time metric kafka_consumergroup_group_lag_seconds
metric will report NaN
for several edge cases. Is this the metric you expect to see something different for? Or are you referring to the offset lag metric kafka_consumergroup_group_lag
?
In either case, the best way to troubleshoot what's happening is to temporarily enable DEBUG
logging so that you can see raw group and offset metadata Kafka Lag Exporter uses. See this comment for details on how to enable DEBUG
.
Hi @seglo, thanks for the response. Yes, I'm referring to the time metric here kafka_consumergroup_group_lag_seconds
. I turned on debug logging and dumped the output as well as the stat values in this gist: https://gist.github.com/rkrage/03c730718b6d33e3de70f8b3e24ce61c
As you can see, once I turn off my test producer, time lag drops to zero initially, then changes to NaN
in the next poll iteration (and stays that way until I start producing messages again).