confluentinc / librdkafka

The Apache Kafka C/C++ library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sporadic crash in rd_kafka_buf_callback()

GerKr opened this issue · comments

Description

In some rare cases the librdkafka.dll crashes. The crashdump shows a bad memory-access
while EnterCriticalSection() is executed. For more details look below in the
section "How to reproduce".

How to reproduce

As it happens very rarely I could not reproduce it.
But I analysed the crashdump and saved following call-stack with some manually added notes.
The crashdump comes from the version v1.6.1 of librdkafka. So the line numbers correspond with this version.
The line marked with "===>" is never reached, when I tried to reproduce the error.

rd_kafka_broker_ops_serve() rdkafka_broker.c:3345 -> 3351
case RD_KAFKA_OP_TERMINATE
rd_kafka_broker_op_serve() rdkafka_broker.c:2950 -> 3276
rd_kafka_broker_fail(rkb, LOG_DEBUG, rdkafka_broker.c:520 -> 577
RD_KAFKA_RESP_ERR__DESTROY,
"Client is terminating");
rd_kafka_bufq_purge(..., 2. param: rd_kafka_bufq_t *rkbufq=&tmpq_waitresp, ...) rdkafka_buf.c:245 -> 256
TAILQ_FOREACH_SAFE(rkbuf, &rkbufq->rkbq_bufs, rkbuf_link, tmp) rdkafka_buf.c:255
===> rd_kafka_buf_callback(..., 5.param: rd_kafka_buf_t *request=rkbuf) rdkafka_buf.c:450 -> 495
rd_kafka_buf_destroy(rkbuf=request) rdkafka_buf.h:804 macro
=>
rd_refcnt_destroywrapper(REFCNT=&(rkbuf)->rkbuf_refcnt, ...) rd.h:355 macro
=>
rd_refcnt_sub(R=REFCNT) rd.h:401 macro
=>
rd_refcnt_sub0(rd_refcnt_t * R) rd.h:325 -> 328
mtx_lock(&R->lock)
EnterCriticalSection()

Additional info:
The crashdump withih the EnterCriticalSection() can exactly be reproduced with a simple program,
which calls the EnterCriticalSection() without calling the InitializeCriticalSection() before.
Exactly this seems to happen, when there are buffers available and the marked line of the call stack is executed.

IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/confluentinc/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.
As I don't know how to reproduce the situation, where buffers are available during the purge of kafka-bufq, I can't tell,
if the error is still available. A source compare of v1.6.1 against v2.3.0 did not show me, that anything was corrected in this direction.

Proposal for making the code more defensive:
In mtx_init() save, that the initialization has taken place.
In mtx_lock() check, if initialization has been done. If not, then implicitely do the initialization.

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version v1.6.1 and most probably v2.3.0
  • Apache Kafka version: N/A
  • librdkafka client configuration: N/A
  • Operating system: Win Server2019, Win10, Win11
  • Provide logs: call stack - see in "How to reproduce"
  • Provide broker log excerpts: N/A
  • Critical issue: crash kills the complete application