Byte size batching
cleaton opened this issue · comments
Hi!
I am looking to use Broadway for a use case where each message has varying byte size and i would like to limit each batch to a certain byte size limit in addition to the count limit.
Are there recommended techniques for such use cases?
It is not supported at the moment. Someone would need to make the current logic configurable.
Oh, to be clear, a PR to add this feature is welcome!
I'm thinking we could generalize the current "count" batching for this. Time-based batching (intervals) is good as is in my opinion, but byte-based batching falls into the category of: I got this message, should I stop and form a batch now or keep adding messages to the batch?
Right now the API is
batchers: [
my_batcher: [batch_timeout: 1000, batch_size: 100]
]
What if we go with something like this?
batchers: [
my_batcher: [batch_timeout: 1000, batch_wrapping: {_initial_acc = 0, &wrap_batch_by_byte_size/2}]
]
defp wrap_batch_by_byte(byte_size, message) do
new_size = byte_size + calculate_byte_size(message)
if new_size > @max_byte_size do
:wrap
else
{:cont, new_size}
end
end
I'm not too in love with the API, but you get the idea. Then the current batch_count
can be reimplemented on top of this:
batch_wrapping: {0, fn count, _ -> if count == batch_size + 1, do: :wrap, else: {:cont, batch_size + 1} end}
Yup, pretty much what I had in mind, except the accumulator is the second argument and instead of :wrap
, we can call it :done
. Should we just allow batch_size
to be a function as well, instead of introducing a new option? I.e. the batch_size is controlled by the given function?
@josevalim yes those are great calls, agreed on all of them.
So let's go with this:
@type batch_size() ::
non_neg_integer() # ← what we have today
| {initial_acc :: term(), (Broadway.Message.t(), acc :: term() -> :done | {:cont, new_acc :: term()}) # ← the new one
My example would turn into this
batch_size: {0, &wrap_batch_by_byte_size/2}
defp wrap_batch_by_byte(message, byte_size) do
new_size = byte_size + calculate_byte_size(message)
if new_size > @max_byte_size do
:done
else
{:cont, new_size}
end
end
Sounds good?
Ship it!
@cleaton want to work on a PR, or want me to tackle this? I'll have some time this weekend but if you want to work on this, I'll be happy to wait
@josevalim @whatyouhide Sorry to jump in and I just wander around the issues to see if I can help anything.
If as per suggestion, the :batch_size
can also support function, how do we set the max_demand
when we initialize the BatcherStage and subscribe to the Processor? The max_demand
now is reusing the :batch_size
.
Good point, we will probably need to support :max_demand
to be given as an option too and use the same default we use throughout Broadway.
@josevalim I am looking into the implementation in batcher_stage. However, I do not understand the usage of trigger
in function wrap_for_delivery
, just an indicator of deliver reason (:size/:timeout/:flush)? And the size of the BatchInfo seems that it could be calculated by simply size: length(reversed_events)
?
@josevalim @whatyouhide Please review if the implementation in PR is acceptable.