caolan / highland

High-level streams library for Node.js and the browser

Home Page:https://caolan.github.io/highland

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

run an arbitrary number of aggregate functions on a stream

terebentina opened this issue · comments

I need to run some aggregating functions on a stream of data coming from mongodb. I do not know in advance which aggregators I need to run - the list is dynamic. The aggregators should receive a stream and then count the number of rows where a certain field exists, get the min/max/mean of a field, etc from that stream. I guess they should also return the stream, not the result.

My idea was to do a fork for each aggregator and run the aggregator on the fork.
Something like:

const data = _(stream);
// aggregatorList = { countA: fnToCountA, sumB: fnToSumB };
const forks = Object.keys(aggregatorList).map((acc, aggId) => Object.assign(acc, { [aggId]: data.fork() }, {});

_(Object.keys(aggregatorList))
  .flatMap((aggId) => aggregatorList[aggId](forks[aggId]));

What is the best way to split a stream for multiple such "consumers"?

Forking to arbitrary consumers is tricky, because there is a real possibility of deadlock if you allow your aggregators to do whatever they want.

For example, this code will deadlock

const data = _(stream);
const fork1 = data.fork();
const fork2 = data.fork();

fork1.take(1).each(console.log);
fork2.take(2).each(console.log);

The above will print the first value of data twice, but it will not print the second value. This is because fork1 only needs the first value (because of take(1)), so it never requests the second value. Because fork is shared backpressure, this will block fork2 from ever receiving the second value. This is a known issue with version 2.x, which we hope to fix in the in-progress 3.0, but it's more complicated than it seems.

To get around this, you need to somehow guarantee that all aggregators consume all data. One way to do this is to implement aggregators as reduce handler and dispatch to all of them in your code. For example,

function countA(count, record) {
  if (count == null) {
    count = 0;
  }

  if (record.a) {
    return count + 1;
  }
  return count;
}

function sumB(sum, record) {
  if (sum == null) {
    sum = 0;
  }
  return sum + record.b;
}

const aggregatorList = { countA, sumB };
const keys = Object.keys(aggregatorList);

_(stream).reduce({}, (memo, record) => {
  keys.forEach(key => {
    const aggregator = aggregatorList[key];
    memo[key] = aggregator(memo[key], record);
  });
  return memo;
}).each(console.log);

Huh, this seems like a simple and elegant solution...thanks Victor.

Just out of curiosity: would observe instead of fork have worked without the deadlock?
Although, without backpressure support and some very slow aggregators I suppose it would have consumed a lot of memory!

Yes, observe would work, but you would have the exact memory issue that you describe.