One event never acked when using handle_batch

Question

One event never acked when using handle_batch

michaelst opened this issue 3 years ago · comments

When using handle_batch to process messages it appears that there is always one item that won't get processed even though the pipeline succeeds. I am having this issue across a couple different consumers and it only happens when using handle_batch. We have batch size set to 1000. Let me know if there is any additional info I can provide to help debug this.

Michael St Clair commented 3 years ago

#65

Michael St Clair · Answer 1 · Sat Oct 02 2021 02:08:49 GMT+0800 (China Standard Time)

The 1 -> 0 count yesterday was me purging the subscription so it would stop delivering the message

José Valim · Answer 2 · Sat Oct 02 2021 03:21:42 GMT+0800 (China Standard Time)

Hi @michaelst! Can you please let us know your version for Broadway and this lib?

To rule out external intervention, can you try running a simple pipeline that does nothing in handle_message nor handle_batch (i.e. it should just ack the messages) and let us know if the issue exists? Meanwhile I will do some code exploration to find possible root causes.

José Valim · Answer 3 · Sat Oct 02 2021 03:25:19 GMT+0800 (China Standard Time)

Also please check if you have any log messages in the terminal. If you accidentally drop messages in your handle_batch, Broadway should let you know about it.

Michael St Clair · Answer 4 · Sat Oct 02 2021 03:33:24 GMT+0800 (China Standard Time)

I checked the logs and there is nothing

  "broadway": {:hex, :broadway, "1.0.0", "da99ca10aa221a9616ccff8cb8124510b7e063112d4593c3bae50448b37bbc90", [:mix], [{:gen_stage, "~> 1.0", [hex: :gen_stage, repo: "hexpm", optional: false]}, {:nimble_options, "~> 0.3.0", [hex: :nimble_options, repo: "hexpm", optional: false]}, {:telemetry, "~> 0.4.3 or ~> 1.0", [hex: :telemetry, repo: "hexpm", optional: false]}], "hexpm", "b86ebd492f687edc9ad44d0f9e359da70f305b6d090e92a06551cef71ec41324"},
  "broadway_cloud_pub_sub": {:hex, :broadway_cloud_pub_sub, "0.7.0", "a6ebc5ca9f020024edc3fd9ae745c47cbf754b2d1247946d1d622ab26074cafd", [:mix], [{:broadway, "~> 1.0", [hex: :broadway, repo: "hexpm", optional: false]}, {:google_api_pub_sub, "~> 0.11", [hex: :google_api_pub_sub, repo: "hexpm", optional: false]}, {:goth, "~> 1.0", [hex: :goth, repo: "hexpm", optional: true]}, {:hackney, "~> 1.6", [hex: :hackney, repo: "hexpm", optional: false]}], "hexpm", "25268afe5b81b3829883c0cf448cbdf1db88e7e3edba979ceca3936d018a23ec"},

Also the consumer isn't doing anything with the messages, here is one of them for example. The trace in datadog shows the pipeline runs successfully with no errors. I also ran it through iex directly with no resulting errors.

  def handle_batch(_batch_name, messages, _batch_info, _context) do

    {:ok, _trace} = Tracer.start_or_continue_trace("reconcile-internally-initiated-ach-transactions")

    {:ok, _result} = match_bank_transfer_items_to_tranasctions()

    {:ok, _result} =
      Multi.new()
      |> match_bank_transfer_items_to_account_activity("INPROGRESS")
      |> match_bank_transfer_items_to_account_activity("FAILED")
      |> ReconciliationStorage.transaction()

    {:ok, _trace} = Tracer.finish_trace()
    messages
  end```

José Valim · Answer 5 · Sat Oct 02 2021 03:38:19 GMT+0800 (China Standard Time)

@michaelst I wonder if the issue is in the batch code itself then. Please try the following:

batch_info should tell you information such as the size plus the trigger. If batch_info.trigger == :size, then batch_info.size == 1000 and length(messages) == 1000. Can you please send this metadata to your datadog or log if any of those constraints fail?

From the graphs, it seems to not happen always, but from time to time?

Michael St Clair · Answer 6 · Sat Oct 02 2021 03:44:17 GMT+0800 (China Standard Time)

Yep I can start logging that and will report back once I get some data.

This is happening every time I am getting a batch of messages.

José Valim · Answer 7 · Sat Oct 02 2021 03:47:08 GMT+0800 (China Standard Time)

Oh, I see, it just does not necessarily happen frequently!

Michael St Clair · Answer 8 · Sat Oct 02 2021 03:50:42 GMT+0800 (China Standard Time)

Yes we get one batch per day, usually in the morning

Michael St Clair · Answer 9 · Sat Oct 02 2021 04:42:57 GMT+0800 (China Standard Time)

I added this log

 Logger.info("handle_batch on #{length(messages)} message(s): #{inspect(batch_info)}")

This is what I eventually end up getting, I published 11 messages. For more context we have 3 k8s pods running all connected to the same subscription in case that could be part of the problem.

|date                    |Host         |message                                                                                                                              |
|------------------------|-------------|-------------------------------------------------------------------------------------------------------------------------------------|
|2021-10-01T20:37:29.138Z|0579708a-ski7|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:34:56.614Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:33:02.254Z|0579708a-ski7|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:33:00.256Z|dff54640-m38o|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:31:16.785Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:29:28.518Z|faf1b956-nceh|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:29:26.511Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:27:49.840Z|faf1b956-nceh|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:27:47.837Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:27:46.836Z|dff54640-m38o|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:26:25.593Z|faf1b956-nceh|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:26:24.591Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:26:22.586Z|dff54640-m38o|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:25:00.934Z|dff54640-m38o|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:24:59.932Z|faf1b956-nceh|handle_batch on 3 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 3, trigger: :timeout}|
|2021-10-01T20:24:58.930Z|0579708a-ski7|handle_batch on 4 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 4, trigger: :timeout}|
|2021-10-01T20:23:43.404Z|0579708a-ski7|handle_batch on 1 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 1, trigger: :timeout}|
|2021-10-01T20:23:42.398Z|faf1b956-nceh|handle_batch on 7 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 7, trigger: :timeout}|
|2021-10-01T20:23:42.397Z|dff54640-m38o|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:22:28.567Z|dff54640-m38o|handle_batch on 3 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 3, trigger: :timeout}|
|2021-10-01T20:22:27.666Z|0579708a-ski7|handle_batch on 8 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 8, trigger: :timeout}|
|2021-10-01T20:21:14.747Z|dff54640-m38o|handle_batch on 5 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 5, trigger: :timeout}|
|2021-10-01T20:21:14.539Z|0579708a-ski7|handle_batch on 2 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 2, trigger: :timeout}|
|2021-10-01T20:21:14.349Z|faf1b956-nceh|handle_batch on 4 message(s): %Broadway.BatchInfo{batch_key: :default, batcher: :default, partition: nil, size: 4, trigger: :timeout}|

Michael St Clair · Answer 10 · Sat Oct 02 2021 04:43:53 GMT+0800 (China Standard Time)

It also seems strange it took so many attempts to ack everything down to 1

José Valim · Answer 11 · Sat Oct 02 2021 05:11:15 GMT+0800 (China Standard Time)

Hrm... if this was a bug in Broadway (say, dropped messages), then I would some variation on the number of unacked messages. The fact it is always 1, even with three machines, start to make me think the bug is elsewhere.

Are there any messages being marked as failed?

You also said you published 11 messages but the batchers received much more than 33 messages altogether! What is your producer configuration? Are you using a stock producer? Can you try removing any producer configuration or hooks or custom clients and see if the issue exists?

Michael St Clair · Answer 12 · Sat Oct 02 2021 05:36:06 GMT+0800 (China Standard Time)

Oh I think I know what happened, I increased the ack timeout and it then acked the message

We had a 60 second timeout, but the pipeline was reporting that it finished in 20ms. So messages were getting redelivered after 60s so you can see each log message grouping with about 60s apart. Maybe the ack timeout needs to be higher than the batch timeout?

José Valim · Answer 13 · Sat Oct 02 2021 05:40:19 GMT+0800 (China Standard Time)

Maybe the ack timeout needs to be higher than the batch timeout?

Yes, otherwise they will certainly be redelivered while we wait for the batch to form. Depending on the difference, we may even run into the situation where the same message is in the same batch twice? 🤔

In any case, if you want to submit a PR that adds this validation, it would most likely live here: https://github.com/dashbitco/broadway_cloud_pub_sub/blob/master/lib/broadway_cloud_pub_sub/producer.ex#L169

Or maybe improvements to the docs. :)

Michael St Clair · Answer 14 · Sat Oct 02 2021 05:45:06 GMT+0800 (China Standard Time)

Would we know from the connection what the ack timeout on the pubsub subscription is, might be good to log an error if they are the same. Or even better do we get an error back when trying to ack a message that can't be acked due to timeout that is maybe just being swallowed?

José Valim · Answer 15 · Sat Oct 02 2021 05:57:30 GMT+0800 (China Standard Time)

Oh, this is a server parameter? So unfortunately I don’t think there is any validation we can do besides adding some notes to the docs. :(

Michael St Clair · Answer 16 · Sat Oct 02 2021 08:39:03 GMT+0800 (China Standard Time)

Ya that is a parameter you setup on the google side

However, do you know if we get an error back if trying to ack a message after the timeout

José Valim · Answer 17 · Sat Oct 02 2021 13:55:29 GMT+0800 (China Standard Time)

I don’t think so. If it failed you would see something logged and they also don’t include this information in the successful response: https://cloud.google.com/pubsub/docs/reference/rest/v1/projects.subscriptions/acknowledge

the link above and others all say the response is empty.

José Valim · Answer 18 · Thu Oct 07 2021 15:05:39 GMT+0800 (China Standard Time)

Hi @michaelst, do you want send a pull request to batch_timeout in Broadway to remember folks to make sure their server configuration has enough message visibility/timeout? Or should I go ahead and do it?

Michael St Clair · Answer 19 · Sat Oct 09 2021 00:06:23 GMT+0800 (China Standard Time)

I can, will do that now