Crash in rd_kafka_broker_add_logical

Question

Crash in rd_kafka_broker_add_logical

JSoet opened this issue 6 months ago · comments

Description

We have a consistent crash which we're able to reproduce in our application which happens with the following stack trace:

(+0x12cc64c)[0x55d8305c664c] rdkafka_broker.c:5754 (discriminator 3)	rd_kafka_broker_add_logical
(+0x13189f9)[0x55d8306129f9] rdkafka_cgrp.c:445	rd_kafka_cgrp_new
(+0x12b499a)[0x55d8305ae99a] rdkafka.c:2415	rd_kafka_new
(+0x12a5376)[0x55d83059f376] KafkaConsumerImpl.cpp:66	RdKafka::KafkaConsumer::create(RdKafka::Conf const*, std::string&)

How to reproduce

The crash is reproduced in our application through an automated test...

In our application we wrap the RdKafka::KafkaConsumer in our own internal class, creating one instance for each topic which we're connecting to. When a configuration change happens, then we reset all our internal classes: we wait for any message processing to be done, then we call close on the RdKafka::KafkaConsumer handle, and then we clean up the pointer, and eventually destroy the class. After the instance is destroyed, then we create a new instance and call RdKafka::KafkaConsumer::create

In order to reproduce the crash, we consistently make changes to our application which results in all our wrapper classes being continuously destroyed and then recreated. The error doesn't happen right away, we have an automated test which reproduces the error, but it usually runs for about 90 minutes of making changes before it eventually runs into the error.

Looking at the line where it's failing it seems that it's failing to create a thread:
rd_assert(rkb && *"failed to create broker thread");

Originally in our code we weren't waiting until the previous handle was fully closed and deleted before creating a new one, and so I assumed that was the reason that we were running out of threads. However, I've now added extra locking around that to ensure that we wait until the previous instance is fully finished and deleted before starting another one, and am still running into the same issue... I've also kept a running check of the number of threads regularly while running (using ps -o nlwp <pid>) and it doesn't get anywhere near the limit shown for the process in ulimit or /proc/<pid>/limits

I'd appreciate any advice on how to look into the issue. Unfortunately I can't share the code, and doubt I'll be able to make a reproducible instance of the problem...

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): Reproduced with librdkafka 1.9.2 and 1.3.0 (we've reproduced it on 2 different versions of our application, which have different librdkafka versions)
Apache Kafka version: <REPLACE with e.g., 0.10.2.3>
librdkafka client configuration: None
[ x] Operating system: RHEL 7 and RHEL 8
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

JSoet · Answer 1 · Wed May 01 2024 03:20:05 GMT+0800 (China Standard Time)

It turned out that this issue was caused because we weren't releasing a thread handle in our code. We use shared pointers, and we had a circular reference causing some objects to never be destroyed, which resulted in the threads that they hold never being joined or detached... So it seems that even though the thread had run to completion, there was still a handle being stored for it, and eventually we ran out of handles, which caused the issue. I guess that because the thread had completed and wasn't running anymore that's why it wasn't showing up in the ps -o nlwp <pid> output. Once we fixed the circular reference issue so that the thread is properly cleaned up then we haven't seen this issue again.