[Bug Report]: Dynamic Worker Pool Scaling not thread safe and causing duplicate pool starts

Question

[Bug Report]: Dynamic Worker Pool Scaling not thread safe and causing duplicate pool starts

gnjack opened this issue 7 months ago · comments

Jack Tench commented 7 months ago

Prerequisites

I have searched issues to ensure it has not already been reported

Description

There appear to be some issues with the current implementation of Dynamic Worker Pool Scaling #429

Starting pools before partitions assigned

If a consumer is not allocated any partitions before Configuration.WorkersCountEvaluationInterval has passed since starting, the worker pool will be started by that timer. If partitions are later allocated to the consumer, the worker pool will be started again without stopping it, leaving it in an invalid state.

Restarting stopped pools

If the worker count changes after partitions have been revoked, the currently stopped worker pool will be restarted. If partitions are then later assigned, the worker pool will also be started twice without stopping, leaving it in an invalid state.

Concurrently starting / stopping a pool

If the worker count changes during a worker pool starting / stopping due to partitions being assigned / revoked (which may take a while as we will wait up to WorkerStopTimeout for in flight messages to be handled) then we may perform concurrent starts / stops of the same pool using non thread safe code, corrupting its state.

Steps to reproduce

Let's say we have a consumer configured with .WithWorkersCount(10). This will set the worker pool to a fixed size of 10 workers, with a workers count evaluation interval of the default 5 minutes.

Working Scenario - Partitions assigned within 5 minutes
Partition assignment triggers ConsumerManager.OnPartitionAssigned which calls WorkerPool.StartAsync, setting WorkerPool.CurrentWorkersCount to 10 and creating and starting 10 workers.

5 minutes later EvaluateWorkersCountAsync runs. WorkerPool.CurrentWorkersCount is equal to the configured number of workers so it returns without doing anything.

Failure Scenario - Partitions assigned after 5 minutes
If no partitions are assigned to the consumer within 5 minutes of starting, when EvaluateWorkersCountAsync runs, WorkerPool.CurrentWorkersCount is 0, so ChangeWorkersCountAsync runs. This restarts the worker pool, which calls WorkerPool.StartAsync even when no partitions are assigned, creating and starting 10 workers.

If partitions are then later assigned after 5 minutes, ConsumerManager.OnPartitionAssigned runs, calling WorkerPool.StartAsync on a pool that is already started. This creates another 10 workers, without stopping the existing 10 workers.

When debugging / analysing a memory dump of the process, 20 workers exist in ConsumerWorkerPool._workers instead of 10. These workers have the worker ids: 0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9

Expected behavior

Worker pools should not be started unless partitions have been assigned to them. They should not be started multiple times without stopping.

Worker pool scaling should not restart a stopped pool with no assigned partitions.

Worker pool scaling should be thread safe and not restart a pool that is concurrently starting / stopping due to partitions being assigned / revoked.

Actual behavior

Worker pool scaling restarts the pool in a non thread safe manner, regardless of its current state.

KafkaFlow version

3.0.6

Jack Tench · Answer 1 · Mon Apr 15 2024 22:24:42 GMT+0800 (China Standard Time)

This is tangentially related to #456 - I believe the cooperative-sticky protocol will cause OnPartitionAssigned / OnPartitionRevoked to be called multiple times in a row, causing the worker pool to be started multiple times without stopping, getting it into an invalid state.

This is because with cooperative rebalancing, partition reassignment is no longer a "stop the world" event where all partitions are first revoked from all consumers, then reassigned to consumers. Partitions can be incrementally assigned / revoked one at a time without affecting the other partitions.

Rui Barbosa · Answer 2 · Wed Apr 17 2024 18:26:47 GMT+0800 (China Standard Time)

Hi @gnjack,

We'll review the issue and investigate the reported behavior.

Could you please confirm if the use case you referenced in the description utilizes the cooperative-sticky option in the configuration?

Best regards,
Rui Barbosa

Jack Tench · Answer 3 · Wed Apr 24 2024 01:05:49 GMT+0800 (China Standard Time)

No, this bug happens regardless of the Kafka partition rebalancing protocol used.

I was just mentioning that the cooperative-sticky protocol (not currently supported by KafkaFlow) would make this much worse. Fixing this bug would might be related to adding support for cooperative-sticky - logic around locking on worker pool changes and multiple starts of the worker pool.