adaltas / node-csv

Full featured CSV parser with simple api and tested against large datasets.

Home Page:https://csv.js.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

csv-parse stream will process the whole buffer even with back pressure.

dobesv opened this issue · comments

Describe the bug

Using csv-parse@5.5.2, I have found that if I provide a buffer input or stream to the parser, it will always send every row in a buffer it gets even if there is back-pressure. It will only apply back-pressure between chunks that it receives from its input.

To Reproduce

import assert from 'assert'
import { parse, Parser } from 'csv-parse'
import { pipeline, Readable, Writable, WritableOptions } from 'stream';


// Create the parser
const parser: Parser = parse();
parser.on('data', (row) => {
    console.log('parser row', row, (parser as any)._readableState.length);
})

const bufs = [];
for(let i=0; i < 100000; i++) {
    bufs.push(Buffer.from(`a${i}, b${i}, ${i}\r\n`));
}
const inputBuffer = Buffer.concat(bufs);
const input = Readable.from([inputBuffer]);
input.on('data', (chunk) => {
    console.log('input chunk', chunk.length);
})

class BackpressureWritable extends Writable {
    count: number;
    threshold: number;

    constructor(options: WritableOptions) {
        super(options);
        this.count = 0;
        this.threshold = 10;
    }

    // @ts-ignore
    write(chunk, encoding, callback) {
        const result = super.write(chunk, encoding, callback);
        console.log(`write(${chunk.toString()}) => ${result}`);
        return result;
    }

    _write(chunk: any, encoding: string, callback: any) {
        this.count++;
        console.log(`_write(${chunk.toString()})`);

        setTimeout(callback, this.count); // Simulating delay to handle backpressure
    }
}


const output = new BackpressureWritable({objectMode: true, highWaterMark: 1});
pipeline(input, parser, output, () => {
    console.log('pipeline output');
});

output.on('end', () => {
    console.log('end');
});

If you run the above script you will see that (parser as any)._readableState.length increments to include all rows immediately - all the rows are buffered into the Readable half of the parser.

In some cases a user of the library may want to pass in a buffer of many MB thinking that it will be processed in small batches (say, using the stream water mark). However, with this library all the rows will be processed immediately, using a lot more memory than necessary.

In order to fix this, the library should check the return value of push, and if it is false it should pause parsing even if it has enough data buffered from the input to read another record. I'm not actually sure currently how to know when it OK to call push again, though. The documentation isn't clear on this point.

See also

I messed around with this for a while trying to figure out a solution, but this seems to be a case that just isn't handled by the Transform API. After the last buffer is provided to the transform, it will call _flush and close the stream and you don't get more chances to try to push the remaining data even if there was back pressure. Kind of an odd gap in the API.

Anyway, my workaround for now is to split up the incoming buffers so that they are always small, that seems to work OK. But with this issue lingering I suppose there will be others in the future who are bitten by the same issue.

Is it not the responsibility of the Stream Reader (input in your case) to provide smaller chunks ? If someone doesn't control the chunk size of its input, maybe he could insert a custom transformer between the input and the parser. From the parser's standpoint, processing all the data it receives seems fair.

I don't think it's unreasonable to expect that even if you pass the entire large dataset as a string or Buffer that it would still process it in chunks to reduce memory usage. A 300MB Buffer of CSV data could easily use several GB of RAM if you parse all the rows in advance. Putting 300MB in RAM might not be a big deal, but expecting the rows to parse in small batches so the additional memory usage is minimal.

At least, this issue pretty much took me 6 hours to figure out and find a workaround yesterday, so If filed the issue here as I could imagine others being unpleasantly surprised by it.

The workaround isn't super hard to implement once you know it's needed, but it's not very obvious that it should be.

With a bit more research I think maybe to resolve this would require implementing Duplex directly instead of using Transform. Kind of a pain, but the nodejs Transform API doesn't have any built-in concept of one-sided backpressure, it basically assumes the input and output are of a similar size and passes through all backpressure upstream.

In case you're still looking for a solution here, I wrote my own back pressure aware transform stream. You can find it here:

https://github.com/sciactive/nephele/blob/37bac308b75fd660ebbbe0d93ed86b504237d20c/packages/plugin-encryption/src/BackPressureTransform.ts

And here's an example of it being used:

https://github.com/sciactive/nephele/blob/37bac308b75fd660ebbbe0d93ed86b504237d20c/packages/plugin-encryption/src/EncryptionProxyResource.ts#L115

I'm thinking about making this a separate NPM package.

Ok, here it is:

https://www.npmjs.com/package/@sciactive/back-pressure-transform

Try it out and see if it works for you.