d3 / d3-dsv

A parser and formatter for delimiter-separated values, such as CSV and TSV.

Home Page:https://d3js.org/d3-dsv

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fail parsing large files

bertrandmartel opened this issue · comments

There is an issue when parsing large file. I tested with a 1.4G JSON file and it throws :

    throw new Error('toString failed');

Error: toString failed
    at Buffer.toString (buffer.js:490:11)
    at StringDecoder.write (string_decoder.js:130:21)
    at StripBOMWrapper.write (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/bom-handling.js:35:28)
    at Object.decode (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/index.js:38:23)
    at /home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/bin/dsv2json:27:35
    at ReadStream.<anonymous> (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/rw/lib/rw/read-file.js:22:33)
    at emitNone (events.js:85:20)
    at ReadStream.emit (events.js:179:7)
    at endReadableNT (_stream_readable.js:913:12)
    at _combinedTickCallback (internal/process/next_tick.js:74:11)
    at process._tickCallback (internal/process/next_tick.js:98:9)

I've found this link which illustrates the same issue with big files

You can test it with

wget http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip
sed -i '1s/^/geonameid\tname\tasciiname\talternatenames\tlatitude\tlongitude\tfeature_class\tfeature_code\tcountry_code\tcc2\tadmin1_code\tadmin2_code\tadmin3_code\tadmin4_code\tpopulation\televation\tdem\ttimezone\tmodification_date\n/' allCountries.txt
time tsv2json  < allCountries.txt > allCountries-pre.json

Do you have a recommended way to parse big files using either command line or via API ?

Note that it's working well with csv-parser :

cat allCountries.txt | csv-parser -s $'\t' > allCountries-pre.json

This is not a streaming parser, so it is subject to Node’s buffer size limitations. This failure is occurring before it even gets to parsing; it’s just trying to decode the input file bytes into a string.

The way to fix this is to rewrite this library to be streaming. That’s doable, but it requires a new API. (The CLI could remain unchanged, however.) This request has already been filed at #20. It’d be a nice improvement, however I have no immediate plans to work on it.