optiopay / kafka

Go driver for Kafka

Home Page:https://godoc.org/github.com/optiopay/kafka

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deadlock where Broker.mu is never released

DrTall opened this issue · comments

commented

Howdy,

@zorkian and I discovered an issue where the Broker goes braindead because some goroutine has hung forever while holding the lock. We saw many calls to PartitionCount blocked on the mutex for hours. Unfortunately we have not yet been able to conclusively reproduce this.

We suspect that the culprit might befetchMetadata because it looks fishy that this function makes an rpc while holding the lock and it doesn't use the typical Go idiom of a timeout channel. It looks like we are trusting the TCP library to honor the timeout settings. But we don't know for sure.

Any thoughts?

Cheers

To add a little bit more color -- the wedge is in such a way that it can't ever produce again. All attempts to produce end up blocked and no packets are ever sent. This helps us believe it's in the metadata processing somewhere.

Also, the initial incident was caused by one of our brokers going bad in such a way that the host disappeared from the network abruptly.

I can have a look later, but without any clue what is going wrong or how to reproduce the problem it might be impossible to find the problem.

commented

Right, I understand. I thought we'd open the issue just for tracking and
hopefully somebody (including us) will stumble on more details later?
On Jan 12, 2016 3:27 AM, "Piotr Husiatyński - notifications@github.com"
github.drtall.4e7c4922dd.notifications#github.com@ob.0sg.net wrote:

I can have a look later, but without any clue what is going wrong or how
to reproduce the problem it might be impossible to find the problem.


Reply to this email directly or view it on GitHub
#49 (comment).

I've been running into this same problem myself. Best I can tell, it goes all the way down to the "kafka.connection" level. Each kafka.connection keeps a map of correlation IDs (request IDs) to channels that hold the response for the corresponding request. Sending a request to Kafka involves the following:

  1. Getting the next Correlation ID (request ID)
  2. Registering a channel with that correlation ID in the connection's map.
  3. Sending the request tagged with the correlation ID
  4. Reading the response from the corresponding channel

A single goroutine is in charge of reading responses off the socket and dispatching them to the corresponding requester. It does this by doing the following in an infinite loop.

  1. Read the next response.
  2. Using the correlation ID in the response, look up the right channel in the connection's map
  3. Push the response to that channel

If a requesting goroutine were to successfully send out a KAFKA request on the socket, and the request never makes it to KAFKA or KAFKA ignores the request, then the response handling goroutine I mentioned above would never read a response for that request off the socket. Consequently, the response goroutine would never push any response onto that requester's channel, and the requester would be blocked forever.

This happens quite often in my environment.

Sometimes the very first Metadata request blocks in this manner so that my application blocks before it can even start up!

It would probably make sense to add a timeout when reading the response channel using a select block. That way, if reading from the channel times out, optiopay/KAFKA can report an error up to its client instead of blocking its client indefinitely.

@keep94 Good spot. Here's very raw code change, just to figure out if changes are going the right direction: f163868

It was not tested yet, but I cannot work on it this week.

Would it be worth making timeout configureable instead of hard coding to 1
minute?

On Wednesday, February 24, 2016, Piotr Husiatyński notifications@github.com
wrote:

@keep94 https://github.com/keep94 Good spot. Here's very raw code
change, just to figure out if changes are going the right direction:
f163868
f163868

It was not tested yet, but I cannot work on it this week.


Reply to this email directly or view it on GitHub
#49 (comment).

It probably does. The is no reason why I picked one minute and not any other value. As I said this commit still requires a lot of work and I cannot do this anytime soon. Sorry.