andreas-schroeder / kafka-health-check

Health Check for Kafka Brokers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

incorrect in-sync replica count for broker health check topic

hokiegeek2 opened this issue · comments

Getting the following error for the :
"producer failure - broker unhealthy: not enough in-sync replicas (19)"

However, kafka-topic.sh --describe shows that the topic has one in-sync replica, so this error condition is incorrect. I am digging into the code, but I am wondering if this is a zookeeper connection issue?

Hello,

thanks for reporting the issue.

Producing doesn't need connection to ZooKeeper, so I would rule that out.

The log output you are seeing to comes from here, and the "not enough in-sync replicas" from here via this line, meaning that the producer client got error code 19 back from the broker while trying to produce. Looking at the specs, this seems to be the right error code.

So I'd suggest finding out why the broker reports error code 19 back? Maybe the check is too eager and tries to health check before the topic is properly set up?

By the way, which version of Kafka are you using?

Hi Andreas,

Good thought regarding topic setup race condition, but this error is being thrown for a previously-established Kafka topic.

Using Kafka 0.10.1.0 as of now.

I added some print statements when I was looking at this a few days ago, and the error is coming from optiopay/kafka driver. There must be a configurable variable that can be set to fix this as it appears to be a race condition of some sort. I've not revisited this in awhile as I have other, more pressing tasks, but hope to do so at some point.

Ah, I think I know that it is. There is one partition that has four replicas, the rest have three. This is not a bug in kafka-health-check or optiopay/kafka, just a weird edge case.

Cool, glad that you found what caused your isse.

Btw, I've extended the compatibility checks to also include 0.10.1.0, and here are the results (spoiler alert: they pass :) )

I'll close this issue if that's okay - if you have any followup on this, reopen it & let me know :)

Yah, makes sense. Again, in any case, this is not a bug in this codebase.

And cool that the extended compatiblliry checks pass