bypassing to next partition (eliminating passthrough traffic)

Question

bypassing to next partition (eliminating passthrough traffic)

sunaku opened this issue 5 years ago · comments

Hello,

I would like to minimize the amount of passthrough traffic that goes through my flows because the memory overhead of message passing (as each item passes from one partition to the next) is raising the peak memory usage (VmHWM) of my app too high.

For example, here is a common use case found in my flows:

From a lazy stream of input items:
- map each input item into many output items belonging to either category A or B
- partition items by categories A and B:
  - allow category A items to pass through to the next partition (don't do anything)
  - emit_and_reduce category B items into output items belonging to category C
- partition items by categories A and C:
  - allow category A items to pass through to the next partition (don't do anything)
  - emit_and_reduce category C items into output items belonging to category A
At this point, the flow only emits category A items!

I have many such flows (similar to the pattern described above) connected together.

Since the topology and interconnection are materialized by Flow, I'm wondering if there can be a way for me to give Flow a hint that certain items can bypass a given partition? For example, in the use case described above, I could provide an option to each Flow.partition() saying Flow.partition(bypass: &(&1.category == :A)) and that would effectively fast-track 🏃‍♂️💨 all category A traffic straight down to the bottom of the flow. 😇 Would this be possible?

Thanks for your consideration.

José Valim · Answer 1 · Sat May 18 2019 03:09:42 GMT+0800 (China Standard Time)

Why do you need to partition by category A and C? Is it because they have to be effectively grouped differently or is it because they have to be processed differently?

If the second, then the best option is not even partition multiple times, but have a single partition (or perhaps no partition at all) and then handle the different processings in a module, decoupled from Flow. In a nutshell, Flow should not be used to organize code, but rather to route the data.

Other than this, you could implement fast tracking by implementing your own dispatcher with your own dispatcher rules, but it is not something we plan to add to Flow out of the box.

Suraj N. Kurapati · Answer 2 · Sat May 18 2019 03:20:00 GMT+0800 (China Standard Time)

Yes, it's the second case: most of the input items can be immediately transformed into category A output items, but the others require a bunch of additional processing to ultimately become category A output items. So most of the traffic is just passing through the downstream partitions, wasting time and memory.

Thanks for the tips! 👍 I shall try writing my own dispatcher and GenStage modules as you have suggested.

José Valim · Answer 3 · Sat May 18 2019 03:25:25 GMT+0800 (China Standard Time)

@sunaku but in this case, why do you partition? Couldn't you first transform category A into B and partition only then?

Suraj N. Kurapati · Answer 4 · Sat May 18 2019 04:26:16 GMT+0800 (China Standard Time)

Sorry, I had left out some of the details earlier when describing the scenario: 😅 I need to use the second partition because all category B items need to be handled by the same reducer. See also #72 (comment)