Byte size batching

Question

Byte size batching

cleaton opened this issue 3 years ago · comments

Hi!
I am looking to use Broadway for a use case where each message has varying byte size and i would like to limit each batch to a certain byte size limit in addition to the count limit.

Are there recommended techniques for such use cases?

José Valim commented 3 years ago

Ship it!

José Valim · Answer 1 · Thu Jul 29 2021 16:36:15 GMT+0800 (China Standard Time)

It is not supported at the moment. Someone would need to make the current logic configurable.

José Valim · Answer 2 · Sat Sep 25 2021 14:49:02 GMT+0800 (China Standard Time)

Oh, to be clear, a PR to add this feature is welcome!

Andrea Leopardi · Answer 3 · Fri Oct 08 2021 16:01:32 GMT+0800 (China Standard Time)

I'm thinking we could generalize the current "count" batching for this. Time-based batching (intervals) is good as is in my opinion, but byte-based batching falls into the category of: I got this message, should I stop and form a batch now or keep adding messages to the batch?

Right now the API is

batchers: [
  my_batcher: [batch_timeout: 1000, batch_size: 100]
]

What if we go with something like this?

batchers: [
  my_batcher: [batch_timeout: 1000, batch_wrapping: {_initial_acc = 0, &wrap_batch_by_byte_size/2}]
]

defp wrap_batch_by_byte(byte_size, message) do
  new_size = byte_size + calculate_byte_size(message)

  if new_size > @max_byte_size do
    :wrap
  else
    {:cont, new_size}
  end
end

I'm not too in love with the API, but you get the idea. Then the current batch_count can be reimplemented on top of this:

batch_wrapping: {0, fn count, _ -> if count == batch_size + 1, do: :wrap, else: {:cont, batch_size + 1} end}

José Valim · Answer 4 · Fri Oct 08 2021 17:16:29 GMT+0800 (China Standard Time)

Yup, pretty much what I had in mind, except the accumulator is the second argument and instead of :wrap, we can call it :done. Should we just allow batch_size to be a function as well, instead of introducing a new option? I.e. the batch_size is controlled by the given function?

Andrea Leopardi · Answer 5 · Fri Oct 08 2021 17:26:07 GMT+0800 (China Standard Time)

@josevalim yes those are great calls, agreed on all of them.

So let's go with this:

@type batch_size() ::
        non_neg_integer() # ← what we have today
        | {initial_acc :: term(), (Broadway.Message.t(), acc :: term() -> :done | {:cont, new_acc :: term()}) # ← the new one

My example would turn into this

batch_size: {0, &wrap_batch_by_byte_size/2}

defp wrap_batch_by_byte(message, byte_size) do
  new_size = byte_size + calculate_byte_size(message)

  if new_size > @max_byte_size do
    :done
  else
    {:cont, new_size}
  end
end

Sounds good?

Andrea Leopardi · Answer 6 · Fri Oct 08 2021 21:54:57 GMT+0800 (China Standard Time)

@cleaton want to work on a PR, or want me to tackle this? I'll have some time this weekend but if you want to work on this, I'll be happy to wait 🙃

Ken Chen · Answer 7 · Tue Feb 15 2022 22:06:39 GMT+0800 (China Standard Time)

@josevalim @whatyouhide Sorry to jump in and I just wander around the issues to see if I can help anything.

If as per suggestion, the :batch_size can also support function, how do we set the max_demand when we initialize the BatcherStage and subscribe to the Processor? The max_demand now is reusing the :batch_size.

José Valim · Answer 8 · Tue Feb 15 2022 22:11:56 GMT+0800 (China Standard Time)

Good point, we will probably need to support :max_demand to be given as an option too and use the same default we use throughout Broadway.

Ken Chen · Answer 9 · Thu Feb 17 2022 21:03:17 GMT+0800 (China Standard Time)

@josevalim I am looking into the implementation in batcher_stage. However, I do not understand the usage of trigger in function wrap_for_delivery, just an indicator of deliver reason (:size/:timeout/:flush)? And the size of the BatchInfo seems that it could be calculated by simply size: length(reversed_events)?

Ken Chen · Answer 10 · Mon Feb 21 2022 10:12:59 GMT+0800 (China Standard Time)

@josevalim @whatyouhide Please review if the implementation in PR is acceptable.