criteo / kafka-sharp

A C# Kafka driver

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to properly handle join group timeout

divan-9 opened this issue · comments

I have 3 kafka brokers, 13 consumer groups listening the same topic, each group consist of 5 consumers.

Sometime, not always, when I restart many services at once, I see log message like this:

Some exception occured while trying to join group mygroup - System.TimeoutException

After that consumer app does nothing, and it's logical: it's not attached to the group. OK.

The bad news is that other group partiticipants do not acquire abandoned partitions. Looks like coordinator knows nothing about this error.

Could you please let me know,

  • How to recover after this kind of exceptions? I don't see any sence in handling it. I'd rather let application crash.

  • How to control value of this timeout? Is it RebalanceTimeoutMs configuration property?

Thank you

First, you should ensure that your client side timeout (ClientRequestTimeoutMs) is greater than the RebalanceTimeoutMs and SessionTimeoutMas. The error you're seeing means that the client has been waiting for to long for responses from one of the brokers (it typically if theare a lot of rebalances at the same time). However it should not break anything, even if the coordinator is not aware of the error because the failing client will try to heartbeat with a bad configuration which will trigger a rebalance.

If your client really stays stuck (even after a topology refresh), this means there may be a strange race condition which make the client fail while having the correct generation from the brokers. I will try to reproduce the problem.

Is it correct that heartbeat interval is SessionTimeout / 2.0, which means that if default session timeout is 15 seconds, then in 7.5 seconds I should expect awakening of the client?

Normally yes.

I think I found the problem. You can try to merge the following branch https://github.com/sdanzan/kafka-sharp/tree/group-stuck in your local repository (if you're compiling from sources) to check if it fix your problem.

I see, thanks.

OK, I guess it can take some time to be sure. I'l write if it help or not.