coop sticky algo on large partition number

Question

coop sticky algo on large partition number

ericwuseattle opened this issue 6 months ago · comments

Description

There are 2 issues, I noticed on kafka coop sticky mode.

The hard code on partition_cnt inside rd_kafka_sticky_assignor_assign_cb
https://github.com/confluentinc/librdkafka/blob/master/src/rdkafka_sticky_assignor.c#L1834
On 3K partitions, it's working without issue, but if I increase the partition to 6K with fresh topic(I mean recreate the topic as new one). Have to increase the session.timeout.ms=10000 and max.poll.interval.ms=10000 from 3s to 10s to make it working.
Otherwise will get kicked out from grp
Broker logs:
Member XXX-6F958DDF5F-CDXRQ~-0793c679-d5ef-4753-9056-7da314e1415b in group XXX-TOPIC-NAME-XXX has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator).

I'm not sure what's causing the timeout, but I'm sure we keep calling kafka poll in a timer infinitely. So increase into 10s it's working without any issue.

Trying further on 15K partitions with 10s, no lucky, would not work, gets kicked out from grp.

Overall,
3K partitions 3s timeout, works.
6K partitions 3s timeout, not work.
6K partitions 10s timeout, works.
15K partitions 10s timeout, not work.

How to reproduce

Large partition numbers.
6K partitions with 3s session timeout
or
15K partitions with 10s session timeout

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): <2.3.0>
Apache Kafka version: <3.0>
librdkafka client configuration: <fetch.min.bytes=1, fetch.wait.max.ms=500, fetch.error.backoff.ms=0, heartbeat.interval.ms=1000, enable.auto.commit=false, enable.partition.eof=false, enable.auto.offset.store=false, max.poll.interval.ms=3000, session.timeout.ms=3000, partition.assignment.strategy=cooperative-sticky>
Operating system: Ubuntu(x64)>
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

ericwuseattle · Answer 1 · Sat Apr 13 2024 02:25:38 GMT+0800 (China Standard Time)

Any thoughts on this problem? Are more details needed?

Emanuele Sabellico · Answer 2 · Thu May 09 2024 17:47:22 GMT+0800 (China Standard Time)

@ericwuseattle could you send some logs with debug=all . It's possible that it's needed to increase those values for a rebalance with those many partitions, but from the logs we can see where most of the time goes.

ericwuseattle · Answer 3 · Fri May 17 2024 05:05:39 GMT+0800 (China Standard Time)

unfortunately I do not have the evn was set up to testing in my hand right now, have you checked the hard code of partition count in the code, if we could fix that part first then I'll find time to retry it.

/* FIXME: Let the cgrp pass the actual eligible partition count /
size_t partition_cnt = member_cnt * 10; / FIXME */

https://github.com/confluentinc/librdkafka/blob/master/src/rdkafka_sticky_assignor.c#L1834

Emanuele Sabellico · Answer 4 · Fri May 17 2024 16:42:28 GMT+0800 (China Standard Time)

Given your configuration you're not using the sticky assignor as
fetch.min.bytes=1, fetch.wait.max.ms=500, fetch.error.backoff.ms=0, heartbeat.interval.ms=1000, enable.auto.commit=false, enable.partition.eof=false, enable.auto.offset.store=false, max.poll.interval.ms=3000, session.timeout.ms=3000

as the default partition.assignment.strategy doesn't include cooperative-sticky could you update your configuration if you're setting it?

Emanuele Sabellico · Answer 5 · Fri May 17 2024 16:54:26 GMT+0800 (China Standard Time)

@ericwuseattle

/* FIXME: Let the cgrp pass the actual eligible partition count /
size_t partition_cnt = member_cnt * 10; / FIXME */

that is just the estimated partition count used for the the initial size of maps and lists. You can try to increase the multiplier and see if it changes something and send some logs of the leader and 2-3 random members.

ericwuseattle · Answer 6 · Sat May 18 2024 00:50:14 GMT+0800 (China Standard Time)

Given your configuration you're not using the sticky assignor as fetch.min.bytes=1, fetch.wait.max.ms=500, fetch.error.backoff.ms=0, heartbeat.interval.ms=1000, enable.auto.commit=false, enable.partition.eof=false, enable.auto.offset.store=false, max.poll.interval.ms=3000, session.timeout.ms=3000

as the default partition.assignment.strategy doesn't include cooperative-sticky could you update your configuration if you're setting it?

Sorry, did not give you all the config, but we did have the setting for
partition.assignment.strategy=cooperative-sticky
I'll update the conf in checklist.

Emanuele Sabellico · Answer 7 · Wed May 22 2024 02:14:14 GMT+0800 (China Standard Time)

@ericwuseattle thanks, other helpful info is:

if you have a rebalance callback set. In that case please test without it
how many members are there in the group, and if they're all subscribed to that topic with 3-6-15K partitions or to other topics too

ericwuseattle · Answer 8 · Thu May 23 2024 04:30:44 GMT+0800 (China Standard Time)

I'm not sure what's causing the timeout, but I'm sure we keep calling kafka poll in a timer infinitely. So increase into 10s it's working without any issue.

You have to set the callback to call the incremental partitions assing/revoke client api, beside that we have some our internal logic but only by posting task way, so would not block or cost much of cpu in the kafka callback worker thread.

Trying further on 15K partitions with 10s, no lucky, would not work, gets kicked out from grp.

30 members totally, only 1 topic.