caolan / highland

High-level streams library for Node.js and the browser

Home Page:https://caolan.github.io/highland

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't fork stream to multiple destinations (aka write streams)

THEtheChad opened this issue · comments

This may be a very specific use case but it's one I've been wrestling with for a couple days. Here's a bit of psuedo code that, hopefully, sums up the issue.

import fs from 'fs'
import hl from 'highland'
import api from 'api'
import db from 'db'

const stream1 = hl(api.get(url))
const stream2 = stream1.fork().flatMap(parent => api.get(parent.child.url))
const stream3 = stream2.fork().flatMap(parent => api.get(parent.child.url))
const stream4 = stream3.fork().flatMap(parent => api.get(parent.child.url))

stream1.pipe(db.insert())
stream2.pipe(db.insert())
stream3.pipe(db.insert())
stream4.pipe(db.insert())

I specifically want to fork so that I get back pressure from all of the streams in the chain. Each stream depends on data from the previous and each step adds more data to the pipeline in an exponential fashion, so buffering using observe is not an option.

Ultimately, the problem I run in to is that the first pipe consumes the stream and subsequent pipes error out.

Usually this kind of thing happens when the pipe from stream1 executes before the fork for later streams are registered, so the initial stream empties earlier than it should.

Can you provide a test case that fails? The example you provide fails with this error Error: Stream already being consumed, you must either fork() or observe() because you're not allowed to consume from streams (e.g., pipe) after you fork them. It's likely that what you're seeing is a result of some sort of data race, and it's hard for me to debug without a repro.

Here's my attempt at replicating your use case (the parseInt gymnastics is to keep the number and order of transforms the same as your example).

const hl = require('highland');

function delay(xs) {
  let i = 0;
  return hl((push, next) => {
    setTimeout(() => {
      if (i < xs.length) {
        push(null, xs[i++]);
      }

      if (i >= xs.length) {
        push(null, hl.nil);
      } else {
        next();
      }
    }, 500);
  });
}

const source = delay(['1\n', '2\n', '3\n']);
const stream1 = source.fork();

const stream2Source = source.fork().flatMap(x => delay([(parseInt(x) * 7) + '\n']));
const stream2 = stream2Source.fork();

const stream3Source = stream2Source.fork().flatMap(x => delay([(parseInt(x) * 11) + '\n']));
const stream3 = stream3Source.fork();

const stream4 = stream3Source.fork().flatMap(x => delay([(parseInt(x) * 13) + '\n']));

stream1.pipe(process.stdout);
stream2.pipe(process.stdout);
stream3.pipe(process.stdout);
stream4.pipe(process.stdout);

It correctly outputs

1
7
77
1001
2
14
154
2002
3
21
231
3003

That's exactly what my issue is Error: Stream already being consumed, you must either fork() or observe(). I should have been more detailed with my description.

I assumed everything would function exactly as you describe because all of my forks happened before any piping was done. It's still not immediately obvious to me why your solution works but I guess it has something to do with how fork works under the hood. I take it that the fork method doesn't actually spin up a stream until data starts flowing?

The main issue I was running in to was that stream2 was filling the buffer of stream3 faster than it could retrieve data from the API. I'm rate limited to 20 requests a second. So if each result from stream2 is transformed into 20+ requests for stream3, stream3 has to buffer the remaining incoming data until it can generate the additional requests. What I wanted to do was create back pressure from stream3 so that stream2 would stop making requests to the API until stream3 was ready to process more more data (and generate more requests itself). At the same time, I wanted to record each stream's output to a database, which can also only process them at a certain rate. So if stream3 was still being persisted, I wanted flow further up the pipeline to be paused.

Rinse and repeat for the relationship between stream3 and stream4.

I'm pretty sure your solution will work but won't get around to testing it until tomorrow.

Thanks so much for the help!

It's still not immediately obvious to me why your solution works but I guess it has something to do with how fork works under the hood.

Sort of. The fact that the error tells you to use fork() even after you've already done so is related to how fork is implemented.

However, the fact that we throw an error at all is a general restriction we place on how transforms can be applied to streams. In general, Highland streams will only allow you to apply a transform to a stream once. So this code will throw the same Stream already being consumed error.

stream.map(...)
stream.flatMap(...) // This statement will throw

We do this because we want shared back-pressure to be explicitly opt-in, since it can cause deadlock issues. A consequence is that if you ever call fork on a stream, you're limited to only calling fork or observe on that stream.

I take it that the fork method doesn't actually spin up a stream until data starts flowing?

It creates a new stream (in the sense that stream !== stream.fork()), but you're right that it doesn't cause data to start flowing until you consume the forks later on with pipe.

Your use case

What you're doing sounds reasonable to me.

Ohhhhhhhh... you're basically alerting people to the fact that there's shared back pressure and want them to use fork or observe so that they're explicitly opting in to (a) shared back pressure or (b) simply observing a stream. I was confused by the error because I interpreted "Stream already being consumed" as meaning that the valve of the stream has already opened and that I missed data events. I thought that I had triggered stream consumption via one of the consumption methods before I finished piping everything (and, hence, missed data).

I'd recommend changing the error message to something like 'Implicit back pressure detected. Please fork or observe the stream before performing parallel transformations.' Just my 2 cents =P I understand what's going on now. Thank you!