Proper way to stop/pause a Pipeline && start again using same consumer_group_id

Question

Proper way to stop/pause a Pipeline && start again using same consumer_group_id

amacciola opened this issue 4 years ago · comments

The scenario i am in is that i am starting a Broadway Pipeline under a DynamicSupervisor and passing all the BroadwayKafka configs to the start_link.

My application gives the option to pause ingestion of a specific topic so how i am handling that currently is that i am calling
DynamicSupervisor.terminate_child(__MODULE__, child_pid) where the child_pid is the pid of the BroadwayPipeline.

But i am running into an issue when starting the BroadwayPipeline again and passing the same exact consumer_group_id.

It logs info logs like:

12:17:49.807 [info] Group member (EventDefinition-70cf5723-1ef5-4d9b-adaa-61093f88b404-e97865e0-f787-11ea-8197-acde48001122,coor=#PID<0.2772.0>,cb=#PID<0.2769.0>,generation=3):
re-joining group, reason::rebalance_in_progress
12:17:49.808 [info] Group member (EventDefinition-70cf5723-1ef5-4d9b-adaa-61093f88b404-e97865e0-f787-11ea-8197-acde48001122,coor=#PID<0.2772.0>,cb=#PID<0.2769.0>,generation=3):
Leaving group, reason: {:noproc, {GenServer, :call, [#PID<0.2769.0>, :drain_after_revoke, :infinity]}}

12:17:49.808 [info] Group member (EventDefinition-70cf5723-1ef5-4d9b-adaa-61093f88b404-e97865e0-f787-11ea-8197-acde48001122,coor=#PID<0.2776.0>,cb=#PID<0.2773.0>,generation=3):
re-joining group, reason::rebalance_in_progress
12:17:49.809 [info] Group member (EventDefinition-70cf5723-1ef5-4d9b-adaa-61093f88b404-e97865e0-f787-11ea-8197-acde48001122,coor=#PID<0.2776.0>,cb=#PID<0.2773.0>,generation=3):
Leaving group, reason: {:noproc, {GenServer, :call, [#PID<0.2773.0>, :drain_after_revoke, :infinity]}}

Then logs error logs:

12:17:49.821 [error] GenServer #PID<0.2772.0> terminating
** (stop) exited in: GenServer.call(#PID<0.2769.0>, :drain_after_revoke, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.3) lib/gen_server.ex:1023: GenServer.call/3
    (broadway_kafka 0.1.4) lib/producer.ex:415: BroadwayKafka.Producer.assignments_revoked/1
    (brod 3.14.0) /Users/amacciola/Desktop/CogilityDev/cogynt-workstation-ingest/deps/brod/src/brod_group_coordinator.erl:477: :brod_group_coordinator.stabilize/3
    (brod 3.14.0) /Users/amacciola/Desktop/CogilityDev/cogynt-workstation-ingest/deps/brod/src/brod_group_coordinator.erl:391: :brod_group_coordinator.handle_info/2
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

Before starting the pipeline successfully. But the problem is that it starts the pipeline with partition=2 begin_offset=undefined so it start re-ingesting all the kafka data again instead of starting from the last committed offset.

Any help would be appreciated !!

José Valim · Answer 1 · Wed Sep 16 2020 03:51:30 GMT+0800 (China Standard Time)

It seems your supervision tree is not terminating correctly. Do you see any errors or the offset persisted the first time you stop it? Can you try reproducing the issue with a regular topology?

You can also try to stop it by passing use Broadway, restart: :tempotaryand then calling GenServer.stop . Also please let us know your broadway and broadway_kafka versions. Make sure you are on latest!

amacciola · Answer 2 · Wed Sep 16 2020 03:57:55 GMT+0800 (China Standard Time)

@josevalim thanks for the help.

my versions are

broadway 0.6.2
broadway_kafka 0.1.4
brod 3.14.0

It seems your supervision tree is not terminating correctly. Do you see any errors or the offset persisted the first time you stop it? Can you try reproducing the issue with a regular topology?

I do not see any errors when i am using DynamicSupervisor.terminate_child/2. It is returning an :ok response. And you mean was the offset persisted in Kafka for that specific ConsumerGroupId yes ? Also what do you mean by try reproducing by using a regular topology ?

José Valim · Answer 3 · Wed Sep 16 2020 04:08:17 GMT+0800 (China Standard Time)

Broadway should persist the offset when the topology terminated but for some reason isn’t. By regular topology I meant one outside of a DynamkcSupervisor, where you start it regularly and terminate by calling System.stop. Basically let’s try to find a minimal way to reproduce the issue. :) what is your offset config?

amacciola · Answer 4 · Wed Sep 16 2020 04:13:04 GMT+0800 (China Standard Time)

@josevalim Alright i can set that up real quick and test it out and i will post the results here. But when you say by calling System.stop do you mean stopping the entire applications supervision tree ?

Entire BroadwayKafka start_link config:

      name: String.to_atom(group_id <> "Pipeline"),
      producer: [
        module:
          {BroadwayKafka.Producer,
           [
             hosts: hosts,
             group_id: group_id,
             topics: topics,
             offset_commit_on_ack: true,
             offset_reset_policy: :earliest,
             group_config: [
               session_timeout_seconds: 15
             ],
             fetch_config: [
               # 3 MB
               max_bytes: 3_145_728
             ],
             client_config: [
               # 15 seconds
               connect_timeout: 15000
             ]
           ]},
        concurrency: 10,
        transformer:
          {__MODULE__, :transform, [group_id: group_id, event_definition_id: event_definition_id]}
      ],
      processors: [
        default: [
          concurrency: Config.event_processor_stages()
        ]
      ],
      context: [event_type: event_type]
    )
  end

José Valim · Answer 5 · Wed Sep 16 2020 04:21:01 GMT+0800 (China Standard Time)

Yes, try different ways to stop, such as System.stop and GenServer.stop to see if we commit the offsets. -- *José Valimhttps://dashbit.co/ <https://dashbit.co/>*

amacciola · Answer 6 · Wed Sep 16 2020 04:46:45 GMT+0800 (China Standard Time)

@josevalim so i took one of my less complicated pipelines that uses the same configs and added it directly to the application supervision tree.

In this screen shot 1 is the pipeline started under the app supervision tree and 2 is the pipeline started under the DynamicSupervisor.

When testing using Genserver.stop/2 i got the same result

iex(5)> drilldown_pid = Process.whereis(:DrilldownPipeline)
#PID<0.890.0>
iex(6)> GenServer.stop(drilldown_pid, :shutdown)
:ok
iex(7)> 13:38:20.648 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.897.0>,cb=#PID<0.894.0>,generation=1):
re-joining group, reason::rebalance_in_progress
13:38:20.648 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.897.0>,cb=#PID<0.894.0>,generation=1):
Leaving group, reason: {:noproc, {GenServer, :call, [#PID<0.894.0>, :drain_after_revoke, :infinity]}}

eventually throwing this error:

13:38:20.665 [error] GenServer #PID<0.905.0> terminating
** (stop) exited in: GenServer.call(#PID<0.902.0>, :drain_after_revoke, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.10.3) lib/gen_server.ex:1023: GenServer.call/3
    (broadway_kafka 0.1.4) lib/producer.ex:415: BroadwayKafka.Producer.assignments_revoked/1
    (brod 3.14.0) /Users/amacciola/Desktop/CogilityDev/cogynt-workstation-ingest/deps/brod/src/brod_group_coordinator.erl:477: :brod_group_coordinator.stabilize/3
    (brod 3.14.0) /Users/amacciola/Desktop/CogilityDev/cogynt-workstation-ingest/deps/brod/src/brod_group_coordinator.erl:391: :brod_group_coordinator.handle_info/2
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

and starting a new pipeline with unknown offsets and re-ingesting all the data.

When i tested with System.stop/1 it killed the application the first time and when i restarted the application it started up and logged

13:43:55.838 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.884.0>,cb=#PID<0.881.0>,generation=7):
elected=true
13:43:55.838 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.896.0>,cb=#PID<0.893.0>,generation=7):
elected=false
13:43:55.838 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.896.0>,cb=#PID<0.893.0>,generation=7):
failed to join group
reason: :rebalance_in_progress
13:43:55.838 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.896.0>,cb=#PID<0.893.0>,generation=7):
re-joining group, reason::rebalance_in_progress
13:43:55.839 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.884.0>,cb=#PID<0.881.0>,generation=7):
failed to join group
reason: :rebalance_in_progress
13:43:55.839 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.884.0>,cb=#PID<0.881.0>,generation=7):
re-joining group, reason::rebalance_in_progress
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.896.0>,cb=#PID<0.893.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.888.0>,cb=#PID<0.885.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.892.0>,cb=#PID<0.889.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.876.0>,cb=#PID<0.873.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.880.0>,cb=#PID<0.877.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.872.0>,cb=#PID<0.869.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.884.0>,cb=#PID<0.881.0>,generation=8):
elected=true
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.900.0>,cb=#PID<0.897.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.868.0>,cb=#PID<0.865.0>,generation=8):
elected=false
13:43:55.841 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.864.0>,cb=#PID<0.861.0>,generation=8):
elected=false
13:43:55.843 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.888.0>,cb=#PID<0.885.0>,generation=8):
assignments received:
  template_solution_events:
    partition=7 begin_offset=undefined
  template_solutions:
    partition=7 begin_offset=undefined
13:43:55.843 [info] Group member (Drilldown-consumer-temp-id-1,coor=#PID<0.876.0>,cb=#PID<0.873.0>,generation=8):
assignments received:

and proceeded to create a new pipeline with the same consumer_group_id but again with an undefined offset so it re-ingested all the data

José Valim · Answer 7 · Wed Sep 16 2020 04:55:43 GMT+0800 (China Standard Time)

Btw, at least something was processed in both cases right? Other things to try out: try using brod 3.10 and see if it changes anything? And try switch the offset_commit_on_ack. Thanks!

amacciola · Answer 8 · Wed Sep 16 2020 04:59:17 GMT+0800 (China Standard Time)

@josevalim yes when the pipeline initially comes up it ingests the data as it should. It is when the Pipeline is restarted is where the issues are happening. I will try downgrading the version of brod. I will test it out with offset_commit_on_ack: false however having that set to true is one of my major needs so if that does not work that would be a reason for me to look elsewhere

Edit:
Also just to note i did overwrite the Kafka_protocol version to be

{:broadway_kafka, "~> 0.1.0", override: true},
      {:kafka_protocol, "~> 2.4.1", override: true},

because if i use the version thatbrod 3.10 uses my application will not compile

José Valim · Answer 9 · Wed Sep 16 2020 05:26:30 GMT+0800 (China Standard Time)

If it doesn’t work then it is most likely a bug here so we will do our best to fix it. :) -- *José Valimhttps://dashbit.co/ <https://dashbit.co/>*

amacciola · Answer 10 · Wed Sep 16 2020 05:35:26 GMT+0800 (China Standard Time)

Much appreciated @josevalim

amacciola · Answer 11 · Fri Sep 18 2020 05:40:14 GMT+0800 (China Standard Time)

Just to add some more information. Here is a screen shot of the describing the consumerGroup

when first starting the pipeline
when stopping the pipeline
when starting the pipeline again with the same consumer_group_id

It does not look like its removing the consumer_group its just shutting down all of its members. So ya it just feels like the offsets are not being persisted

amacciola · Answer 12 · Tue Sep 22 2020 05:15:25 GMT+0800 (China Standard Time)

@josevalim So cloned BroadwayKafka and added logs and was doing testing and i think i found what the main issue was. In my Pipelines i was defining an ack callback and doing some work once a message was ack'd. It seems since i was defining my own the BroadwayKafka Acknowledgers were not being called. Therefore the offsets were not being committed.

I have got it working with my ack callback commented out. But now i am missing the logic that i had been running at the end of each acked message. There is no way to define my own as well with this library is there ?

José Valim · Answer 13 · Tue Sep 22 2020 05:40:55 GMT+0800 (China Standard Time)

@amacciola When you set your own, you can store the lib one and call it. However, I would suggest to simply call ack_immediately and then execute your ack logic, without changing the message's ack fields.

In any case, since this is not a lib bug, I will close this. Thanks for the follow up!

amacciola · Answer 14 · Tue Sep 22 2020 05:44:42 GMT+0800 (China Standard Time)

@josevalim

However, I would suggest to simply call ack_immediately and then execute your ack logic, without changing the message's ack fields.

I am not quite sure what you mean by this. Could you elaborate pls ?

José Valim · Answer 15 · Tue Sep 22 2020 05:49:24 GMT+0800 (China Standard Time)

Instead of changing the ack fields, you can use Broadway.Message.ack_immediately and then whatever you want to do right after calling ack_immediately. Basically, broadway has ways for you to force an ack to happen at a certain moment, precisely so you don't have to message with the ack fields.