adaltas / node-csv

Full featured CSV parser with simple api and tested against large datasets.

Home Page:https://csv.js.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Docs: Async example for parsing a large CSV via streams

aeddie-zapidhire opened this issue · comments

Summary

As a user, I would like an example of how to parse a very large CSV file using streams AND doing some async operations as a part of the stream pipeline.

Motivation

It is not easy to find a clear of example of how to do this. Took a lot of Googling.

Draft

Could this example, or similar, be added to the docs:

Using the pipeline API (async)

import { parse } from 'csv-parse'
import fs from 'fs'
import { pipeline } from 'stream/promises'

/**
 * Do something async with a row.
 * @param {*} row A row of the CSV as an object.
 */
async function handleRow(row) {
  // Do something async
}

/**
 * Read the CSV using a stream and return the number of rows handled.
 */
async function readCsv(filePath) {
  let count = 0

  await pipeline(
    fs.createReadStream(filePath),
    parse({
      skip_empty_lines: true,
      columns: true
    }),
    async function* (source) {
      for await (const chunk of source) {
        yield await handleRow(chunk)
        count++
      }
    }
  )

  return count
}

@wdavidw well, there you go. I see the search terms I needed to use to get those pages above the fold. Apologies for the distraction.

Oh, I see what I did wrong. So, I saw that the iterator code was outside of the stream pipeline, so I obviously skimmed that example too quickly and thought that it was just loading all the records into memory and then iterating over them.

I am sorry for commenting on this closed issue but I am facing issues with pipeline + to param of parser, this is minimal repo:

const { pipeline } = require("stream/promises");
const { open, readFile } = require("fs/promises");
const { parse } = require("csv-parse");
const { Readable } = require("stream");

const parser = parse({
    trim              : true,
    columns           : true,
    skip_empty_lines  : true,
    relax_column_count: true,
    from              : 1,
    to                : 1001
});

const fn = async () => {
    const buffer = await readFile("csv.csv");
    await pipeline(
        Readable.from(buffer),
        parser,
        async function* csvRow (source) {
            for await (const chunk of source) {
                yield chunk;
            }
        },
        async function* handleSingleRow (source) {
            for await (const row of source) {
                console.log("ROW:");
                console.log(row);
            }
            yield
        }
    );
    console.log("AFTER PIPELINE");
}

console.log("START");
fn().then(() => console.log("FINISHED")).catch(error => console.error("error", error));

csv.csv is CSV file with 2160 records: (+ header number,text)

1,test

Now executign this piece of code never prints AFTER PIPELINE or FINISHED, neither any error. Process just exists.
If I remove:
to : 1001
line from parser options then it works and finishes with correct logs.

Is this an oversight on my part?

Node: 16.20.0
csv-parse: 5.3.10