dashbitco / broadway_kafka

A Broadway connector for Kafka

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Proper way to stop/pause a Pipeline && start again using same consumer_group_id

amacciola opened this issue · comments

The scenario i am in is that i am starting a Broadway Pipeline under a DynamicSupervisor and passing all the BroadwayKafka configs to the start_link.

My application gives the option to pause ingestion of a specific topic so how i am handling that currently is that i am calling
DynamicSupervisor.terminate_child(__MODULE__, child_pid) where the child_pid is the pid of the BroadwayPipeline.

But i am running into an issue when starting the BroadwayPipeline again and passing the same exact consumer_group_id.

It logs info logs like:

12:17:49.807 [info] Group member (EventDefinition-70cf5723-1ef5-4d9b-adaa-61093f88b404-e97865e0-f787-11ea-8197-acde48001122,coor=#PID<0.2772.0>,cb=#PID<0.2769.0>,generation=3):
re-joining group, reason::rebalance_in_progress
12:17:49.808 [info] Group member (EventDefinition-70cf5723-1ef5-4d9b-adaa-61093f88b404-e97865e0-f787-11ea-8197-acde48001122,coor=#PID<0.2772.0>,cb=#PID<0.2769.0>,generation=3):
Leaving group, reason: {:noproc, {GenServer, :call, [#PID<0.2769.0>, :drain_after_revoke, :infinity]}}

12:17:49.808 [info] Group member (EventDefinition-70cf5723-1ef5-4d9b-adaa-61093f88b404-e97865e0-f787-11ea-8197-acde48001122,coor=#PID<0.2776.0>,cb=#PID<0.2773.0>,generation=3):
re-joining group, reason::rebalance_in_progress
12:17:49.809 [info] Group member (EventDefinition-70cf5723-1ef5-4d9b-adaa-61093f88b404-e97865e0-f787-11ea-8197-acde48001122,coor=#PID<0.2776.0>,cb=#PID<0.2773.0>,generation=3):
Leaving group, reason: {:noproc, {GenServer, :call, [#PID<0.2773.0>, :drain_after_revoke, :infinity]}}

Then logs error logs:

12:17:49.821 [error] GenServer #PID<0.2772.0> terminating
** (stop) exited in: GenServer.call(#PID<0.2769.0>, :drain_after_revoke, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.3) lib/gen_server.ex:1023: GenServer.call/3
    (broadway_kafka 0.1.4) lib/producer.ex:415: BroadwayKafka.Producer.assignments_revoked/1
    (brod 3.14.0) /Users/amacciola/Desktop/CogilityDev/cogynt-workstation-ingest/deps/brod/src/brod_group_coordinator.erl:477: :brod_group_coordinator.stabilize/3
    (brod 3.14.0) /Users/amacciola/Desktop/CogilityDev/cogynt-workstation-ingest/deps/brod/src/brod_group_coordinator.erl:391: :brod_group_coordinator.handle_info/2
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

Before starting the pipeline successfully. But the problem is that it starts the pipeline with partition=2 begin_offset=undefined so it start re-ingesting all the kafka data again instead of starting from the last committed offset.

Any help would be appreciated !!

It seems your supervision tree is not terminating correctly. Do you see any errors or the offset persisted the first time you stop it? Can you try reproducing the issue with a regular topology?

You can also try to stop it by passing use Broadway, restart: :tempotaryand then calling GenServer.stop . Also please let us know your broadway and broadway_kafka versions. Make sure you are on latest!

@josevalim thanks for the help.

my versions are

broadway 0.6.2
broadway_kafka 0.1.4
brod 3.14.0

It seems your supervision tree is not terminating correctly. Do you see any errors or the offset persisted the first time you stop it? Can you try reproducing the issue with a regular topology?

I do not see any errors when i am using DynamicSupervisor.terminate_child/2. It is returning an :ok response. And you mean was the offset persisted in Kafka for that specific ConsumerGroupId yes ? Also what do you mean by try reproducing by using a regular topology ?

Broadway should persist the offset when the topology terminated but for some reason isn’t. By regular topology I meant one outside of a DynamkcSupervisor, where you start it regularly and terminate by calling System.stop. Basically let’s try to find a minimal way to reproduce the issue. :) what is your offset config?

@josevalim Alright i can set that up real quick and test it out and i will post the results here. But when you say by calling System.stop do you mean stopping the entire applications supervision tree ?

Entire BroadwayKafka start_link config:

      name: String.to_atom(group_id <> "Pipeline"),
      producer: [
        module:
          {BroadwayKafka.Producer,
           [
             hosts: hosts,
             group_id: group_id,
             topics: topics,
             offset_commit_on_ack: true,
             offset_reset_policy: :earliest,
             group_config: [
               session_timeout_seconds: 15
             ],
             fetch_config: [
               # 3 MB
               max_bytes: 3_145_728
             ],
             client_config: [
               # 15 seconds
               connect_timeout: 15000
             ]
           ]},
        concurrency: 10,
        transformer:
          {__MODULE__, :transform, [group_id: group_id, event_definition_id: event_definition_id]}
      ],
      processors: [
        default: [
          concurrency: Config.event_processor_stages()
        ]
      ],
      context: [event_type: event_type]
    )
  end

@josevalim so i took one of my less complicated pipelines that uses the same configs and added it directly to the application supervision tree.

In this screen shot 1 is the pipeline started under the app supervision tree and 2 is the pipeline started under the DynamicSupervisor.
Screen Shot 2020-09-15 at 1 30 29 PM

When testing using Genserver.stop/2 i got the same result

iex(5)> drilldown_pid = Process.whereis(:DrilldownPipeline)
#PID<0.890.0>
iex(6)> GenServer.stop(drilldown_pid, :shutdown)
:ok
iex(7)> 13:38:20.648 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.897.0>,cb=#PID<0.894.0>,generation=1):
re-joining group, reason::rebalance_in_progress
13:38:20.648 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.897.0>,cb=#PID<0.894.0>,generation=1):
Leaving group, reason: {:noproc, {GenServer, :call, [#PID<0.894.0>, :drain_after_revoke, :infinity]}}

eventually throwing this error:

13:38:20.665 [error] GenServer #PID<0.905.0> terminating
** (stop) exited in: GenServer.call(#PID<0.902.0>, :drain_after_revoke, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.3) lib/gen_server.ex:1023: GenServer.call/3
    (broadway_kafka 0.1.4) lib/producer.ex:415: BroadwayKafka.Producer.assignments_revoked/1
    (brod 3.14.0) /Users/amacciola/Desktop/CogilityDev/cogynt-workstation-ingest/deps/brod/src/brod_group_coordinator.erl:477: :brod_group_coordinator.stabilize/3
    (brod 3.14.0) /Users/amacciola/Desktop/CogilityDev/cogynt-workstation-ingest/deps/brod/src/brod_group_coordinator.erl:391: :brod_group_coordinator.handle_info/2
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

and starting a new pipeline with unknown offsets and re-ingesting all the data.

When i tested with System.stop/1 it killed the application the first time and when i restarted the application it started up and logged

13:43:55.838 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.884.0>,cb=#PID<0.881.0>,generation=7):
elected=true
13:43:55.838 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.896.0>,cb=#PID<0.893.0>,generation=7):
elected=false
13:43:55.838 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.896.0>,cb=#PID<0.893.0>,generation=7):
failed to join group
reason: :rebalance_in_progress
13:43:55.838 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.896.0>,cb=#PID<0.893.0>,generation=7):
re-joining group, reason::rebalance_in_progress
13:43:55.839 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.884.0>,cb=#PID<0.881.0>,generation=7):
failed to join group
reason: :rebalance_in_progress
13:43:55.839 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.884.0>,cb=#PID<0.881.0>,generation=7):
re-joining group, reason::rebalance_in_progress
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.896.0>,cb=#PID<0.893.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.888.0>,cb=#PID<0.885.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.892.0>,cb=#PID<0.889.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.876.0>,cb=#PID<0.873.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.880.0>,cb=#PID<0.877.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.872.0>,cb=#PID<0.869.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.884.0>,cb=#PID<0.881.0>,generation=8):
elected=true
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.900.0>,cb=#PID<0.897.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.868.0>,cb=#PID<0.865.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.864.0>,cb=#PID<0.861.0>,generation=8):
elected=false
13:43:55.843 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.888.0>,cb=#PID<0.885.0>,generation=8):
assignments received:
  template_solution_events:
    partition=7 begin_offset=undefined
  template_solutions:
    partition=7 begin_offset=undefined
13:43:55.843 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.876.0>,cb=#PID<0.873.0>,generation=8):
assignments received:

and proceeded to create a new pipeline with the same consumer_group_id but again with an undefined offset so it re-ingested all the data

Btw, at least something was processed in both cases right? Other things to try out: try using brod 3.10 and see if it changes anything? And try switch the offset_commit_on_ack. Thanks!

@josevalim yes when the pipeline initially comes up it ingests the data as it should. It is when the Pipeline is restarted is where the issues are happening. I will try downgrading the version of brod. I will test it out with offset_commit_on_ack: false however having that set to true is one of my major needs so if that does not work that would be a reason for me to look elsewhere

Edit:
Also just to note i did overwrite the Kafka_protocol version to be

{:broadway_kafka, "~> 0.1.0", override: true},
      {:kafka_protocol, "~> 2.4.1", override: true},

because if i use the version thatbrod 3.10 uses my application will not compile

Much appreciated @josevalim

Just to add some more information. Here is a screen shot of the describing the consumerGroup

  1. when first starting the pipeline
  2. when stopping the pipeline
  3. when starting the pipeline again with the same consumer_group_id

Screen Shot 2020-09-17 at 2 36 47 PM

It does not look like its removing the consumer_group its just shutting down all of its members. So ya it just feels like the offsets are not being persisted

@josevalim So cloned BroadwayKafka and added logs and was doing testing and i think i found what the main issue was. In my Pipelines i was defining an ack callback and doing some work once a message was ack'd. It seems since i was defining my own the BroadwayKafka Acknowledgers were not being called. Therefore the offsets were not being committed.

I have got it working with my ack callback commented out. But now i am missing the logic that i had been running at the end of each acked message. There is no way to define my own as well with this library is there ?

@amacciola When you set your own, you can store the lib one and call it. However, I would suggest to simply call ack_immediately and then execute your ack logic, without changing the message's ack fields.

In any case, since this is not a lib bug, I will close this. Thanks for the follow up!

@josevalim

However, I would suggest to simply call ack_immediately and then execute your ack logic, without changing the message's ack fields.

I am not quite sure what you mean by this. Could you elaborate pls ?

Instead of changing the ack fields, you can use Broadway.Message.ack_immediately and then whatever you want to do right after calling ack_immediately. Basically, broadway has ways for you to force an ack to happen at a certain moment, precisely so you don't have to message with the ack fields.