split-array-stream is slow

Question

split-array-stream is slow

callmehiphop opened this issue 5 years ago · comments

Truthfully I'm not sure if I should be filing an issue here or over at split-array-stream, but I figured I would start here and see what everyone else thinks.

I recently started work on a BigQuery feature (googleapis/nodejs-bigquery#484) that was supposed to be a big performance boost for dealing with large result sets. However what I ended up finding was that it was significantly slower, which threw me off quite a bit. Debugging lead me to the paginator, it would appear in my BQ refactor one of the side effects was that the individual page size of a request went from about 18k results to 60k results. Interestingly enough, it only took about 40~50ms for it to process 18k results, however it takes a staggering 3-5 seconds to process 60k results.

My guess is that using setImmediate slows down the process pretty drastically. I tested a local patch where instead of using the split stream, I used a PassThrough that I would write to using a for loop and in the event that write() returned false, I would wait for the drain event before I continued the loop. This allowed me to process 60k results in about 40ms. This leads me to think that we should not be using setImmediate.

Dave Gramlich · Answer 1 · Fri Jun 28 2019 01:39:52 GMT+0800 (China Standard Time)

@bcoe @JustinBeckwith @stephenplusplus thoughts?

Stephen · Answer 2 · Fri Jun 28 2019 02:17:15 GMT+0800 (China Standard Time)

Let's fix split array stream!

Benjamin E. Coe · Answer 3 · Fri Jun 28 2019 02:20:17 GMT+0800 (China Standard Time)

@callmehiphop I'm guessing the goal here was to avoid blocking the event loop when processing a large list of data?

This issue sounds similar to this article which @sduskis shared recently.

I agree, if it's just data being pushed into the stream, I think setImmediate is probably just a source of overhead.

Dave Gramlich · Answer 4 · Sat Jun 29 2019 02:48:52 GMT+0800 (China Standard Time)

So I took a crack at a refactor, but in doing so it dawned on me that in order to get the behavior where I think we want it for the paginator, we would be making a lot of fundamental changes and maybe that's no good.

The big realization I had was that split-array-stream doesn't really provide a way to manage flow control. You can give it a large number of items (60k, for example) and it notifies you when it is done pushing, not when it has room for more items. As the paginator is today, it makes subsequent requests when the final push is done. This actually sounds like a potential memory leak to me since we are potentially pushing more results before the user has even finished processing the previous page.

I'm not really sure how I feel about all of this, but my gut says maybe we shouldn't use it as a dependency at all. Anyone have any thoughts or opinions on that idea?

Stephen · Answer 5 · Sat Jun 29 2019 02:58:20 GMT+0800 (China Standard Time)

@callmehiphop before we just go ditchin' stuff, let's go back to what the problem split-array-stream is trying to solve :) If there's a different module out there that serves the same purpose, but does it better, we should deprecate split-array-stream and move on to the good one.

Here's a writeup I did over on the PR-- hopefully this can help gather ideas on the right approach (source: stephenplusplus/split-array-stream#3 (comment)):

This concept has always been a hard problem to solve. I'm going to repeat it to make sure I'm saying it right--

We have a readable stream in object mode that spits out multiple items within a single data event. We essentially want to flatten them and emit them one by one to the sink stream.

So we have the source stream, and we make a sink stream. In the data handler of the source stream, we split the array into multiple data events, and push them to the sink stream.

The user is holding the sink stream. We need to know that it is ready to receive more data before we keep pushing relentlessly. It could decide half way through a single data event from the source stream that it needs a break. So we need to check over and over "we good? Still ready?" between each write to the sink.

If we can't write, we wait for the drain event to let us know the sink is ready. But the sink could have its own problems, and then we are left with objects in memory within this data handler that never have a home. The memory used by this logic is locked up and never relaseable because we never gracefully exited the loop.

I suppose the worst that happens if we skip the flow control logic, is we end up with <=15 (assuming hWM of 16) objects in memory. The source stream itself is already backpressured so no excessive data events will be handled. I know there are places we parallelize source streams, so it's really <=15 * numParallel.

Is there a solution for this? Is flattening a data event array just not possible without these side effects?

Dave Gramlich · Answer 6 · Sat Jun 29 2019 03:16:57 GMT+0800 (China Standard Time)

@stephenplusplus I kind of don't think we even need a module here. IMO the simplest solution might just be to extend the Readable class and only fetch a page whenever _read() is called.

Stephen · Answer 7 · Sat Jun 29 2019 03:25:20 GMT+0800 (China Standard Time)

Will the user see what we have now:

storage.getBuckets()
  .on('data', bucket => {
    // bucket = a single Bucket object
  })

Or:

storage.getBuckets()
  .on('data', buckets => {
    // buckets = a variable length array of buckets (a 'page')
  })

If you're thinking we can still keep the same system we have now, what about split-array-stream would be different than your solution? What I see as the problem is that an API response page could include 100 things, but the stream consumer might only be able to handle that 20 at a time. If we push all 100 at once, we could overfill the buffer. If we try to do 20 at a time, then we got into the issues I was describing earlier about the stream consumer having complications that lead to us never releasing the remaining items we have in our stream.

Dave Gramlich · Answer 8 · Sat Jun 29 2019 07:22:36 GMT+0800 (China Standard Time)

I'm definitely not suggesting that we make a breaking change. I think overfilling the buffer should be fine, IMHO. Doing it that way should prevent _read() from being called too soon and avoids us needing to tap into drain.

Stephen · Answer 9 · Sat Jun 29 2019 07:26:56 GMT+0800 (China Standard Time)

I don't think you're proposing anything different than SAS. I must be missing something. Halp.

Dave Gramlich · Answer 10 · Mon Jul 01 2019 21:32:39 GMT+0800 (China Standard Time)

@stephenplusplus maybe I'm missing something, please let me know if I'm overlooking any details.

AFAIK we use SAS to do 2 things

Take a page of results and push each individual item into a stream
Track when all items have been pushed (not consumed) via promise to fetch another page

Currently it appears the closest thing we have to a flow control mechanism is the use of setImmediate within SAS. My understanding is that if the user is slow to process the results, paginator will continue to push more and more results because it is using SAS's promise as an indicator to fetch subsequent pages. IMO this is a problem because there is nothing to stop paginator/SAS from overfilling the stream buffer, potentially causing a memory leak. Removing the setImmediate call on top of that seems like it might make the issue worse since there would be an even smaller delay between requests.

Since this is all an implementation detail, IMO the obvious solution would be to wait until all the results have been consumed (not pushed) before fetching additional pages. This would allow us to remove setImmediate without worrying about a memory leak. The reason I suggested removing SAS, is because it does not offer this functionality and I'm not sure if it should. Ultimately if that kind of behavior change was made, I'd probably lean on you as the maintainer to make that change.

Stephen · Answer 11 · Tue Jul 02 2019 01:16:12 GMT+0800 (China Standard Time)

Nope, you're right. I misunderstood my own module, apparently. I thought we were returning a stream that was sandwiched between the source and the destination. Under that assumption, the destination stream would be pulling from the SAS stream.

Could you pseudo-code the way you would split the results into a destination stream from a library, without SAS?

Dave Gramlich · Answer 12 · Tue Jul 02 2019 03:27:45 GMT+0800 (China Standard Time)

@stephenplusplus sure, I'm probably missing a couple odds and ends, but overall this is how I would probably approach it.

const {Readable} = require('stream');

class PageStream extends Readable {
  constructor(requestFn) {
    super({objectMode: true}); 
    this.request = requestFn;
    this.nextQuery = {};
    this.pending = [];
  }
  async _read() {
    if (!this.pending.length) {
      if (!this.nextQuery) {
        this.push(null);
        return;
      }
      
      const [results, nextQuery] = await this.request(this.nextQuery);
      this.pending = results;
      this.nextQuery = nextQuery;
    }
    
    let more = true;
    
    while (more && this.pending.length) {
      more = this.push(this.pending.shift());
    }
    
    if (more) {
       process.nextTick(() => this._read()); 
    }
  }
}

Stephen · Answer 13 · Wed Jul 03 2019 09:14:18 GMT+0800 (China Standard Time)

That looks like a cool idea. It would make sense to just plop that here in nodejs-paginator as a reusable utility. If we want to evolve split-array-stream, I expanded on your code here. Please take a look and let me know your thoughts.

Dave Gramlich · Answer 14 · Tue Jul 09 2019 04:04:26 GMT+0800 (China Standard Time)

@stephenplusplus something I've been struggling with is our decision to use Transform streams. In your gist you override _read which I think will break Transfrom#write causing anything that is written to the stream to get lost. It is probably unlikely that users would write to the stream, but I think it would be unwise to make decisions on that assumption.

Stephen · Answer 15 · Tue Jul 09 2019 04:11:51 GMT+0800 (China Standard Time)

Could you expand on that potential hazard, showing an example?

something I've been struggling with is our decision to use Transform streams.

You mean in general? Why is that? I believe it makes sense in my case where it takes an array and transforms it into single items.

There is another problem with both of our examples, where we're actually breaking the Node docs and calling the underlying _read/_transform methods ourselves.

Dave Gramlich · Answer 16 · Tue Jul 09 2019 04:49:57 GMT+0800 (China Standard Time)

@stephenplusplus Transform streams use the _read method to take any data written via write and pass it to the _transform function (src).

I believe it makes sense in my case where it takes an array and transforms it into single items.

Sure that makes sense, I think your gist gets fuzzy for me because the fetching of an array should probably come from another stream and get piped into the split stream.

This is not why I struggle with it, we return it (a Transform) to the user but in most cases the write functionality won't be used for anything. Furthermore we use end() as an indicator to stop reading whereas its a Writable method and usually used to indicate that no more data should be written.

Stephen · Answer 17 · Tue Jul 09 2019 05:13:21 GMT+0800 (China Standard Time)

It does take data piped to it. The gist shows two options for how it could be used. And moreover, this is an example implementation for SAS, so if you have improvements, please!