dashbitco / broadway

Concurrent and multi-stage data ingestion and data processing with Elixir

Home Page:https://elixir-broadway.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Byte size batching

cleaton opened this issue · comments

Hi!
I am looking to use Broadway for a use case where each message has varying byte size and i would like to limit each batch to a certain byte size limit in addition to the count limit.

Are there recommended techniques for such use cases?

It is not supported at the moment. Someone would need to make the current logic configurable.

Oh, to be clear, a PR to add this feature is welcome!

I'm thinking we could generalize the current "count" batching for this. Time-based batching (intervals) is good as is in my opinion, but byte-based batching falls into the category of: I got this message, should I stop and form a batch now or keep adding messages to the batch?

Right now the API is

batchers: [
  my_batcher: [batch_timeout: 1000, batch_size: 100]
]

What if we go with something like this?

batchers: [
  my_batcher: [batch_timeout: 1000, batch_wrapping: {_initial_acc = 0, &wrap_batch_by_byte_size/2}]
]

defp wrap_batch_by_byte(byte_size, message) do
  new_size = byte_size + calculate_byte_size(message)

  if new_size > @max_byte_size do
    :wrap
  else
    {:cont, new_size}
  end
end

I'm not too in love with the API, but you get the idea. Then the current batch_count can be reimplemented on top of this:

batch_wrapping: {0, fn count, _ -> if count == batch_size + 1, do: :wrap, else: {:cont, batch_size + 1} end}

Yup, pretty much what I had in mind, except the accumulator is the second argument and instead of :wrap, we can call it :done. Should we just allow batch_size to be a function as well, instead of introducing a new option? I.e. the batch_size is controlled by the given function?

@josevalim yes those are great calls, agreed on all of them.

So let's go with this:

@type batch_size() ::
        non_neg_integer() # ← what we have today
        | {initial_acc :: term(), (Broadway.Message.t(), acc :: term() -> :done | {:cont, new_acc :: term()}) # ← the new one

My example would turn into this

batch_size: {0, &wrap_batch_by_byte_size/2}

defp wrap_batch_by_byte(message, byte_size) do
  new_size = byte_size + calculate_byte_size(message)

  if new_size > @max_byte_size do
    :done
  else
    {:cont, new_size}
  end
end

Sounds good?

Ship it!

@cleaton want to work on a PR, or want me to tackle this? I'll have some time this weekend but if you want to work on this, I'll be happy to wait 🙃

@josevalim @whatyouhide Sorry to jump in and I just wander around the issues to see if I can help anything.

If as per suggestion, the :batch_size can also support function, how do we set the max_demand when we initialize the BatcherStage and subscribe to the Processor? The max_demand now is reusing the :batch_size.

Good point, we will probably need to support :max_demand to be given as an option too and use the same default we use throughout Broadway.

@josevalim I am looking into the implementation in batcher_stage. However, I do not understand the usage of trigger in function wrap_for_delivery, just an indicator of deliver reason (:size/:timeout/:flush)? And the size of the BatchInfo seems that it could be calculated by simply size: length(reversed_events)?

@josevalim @whatyouhide Please review if the implementation in PR is acceptable.