Fail parsing large files
bertrandmartel opened this issue · comments
There is an issue when parsing large file. I tested with a 1.4G JSON file and it throws :
buffer.js:490
throw new Error('toString failed');
^
Error: toString failed
at Buffer.toString (buffer.js:490:11)
at StringDecoder.write (string_decoder.js:130:21)
at StripBOMWrapper.write (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/bom-handling.js:35:28)
at Object.decode (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/iconv-lite/lib/index.js:38:23)
at /home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/bin/dsv2json:27:35
at ReadStream.<anonymous> (/home/user/.nvm/versions/node/v5.10.0/lib/node_modules/d3-dsv/node_modules/rw/lib/rw/read-file.js:22:33)
at emitNone (events.js:85:20)
at ReadStream.emit (events.js:179:7)
at endReadableNT (_stream_readable.js:913:12)
at _combinedTickCallback (internal/process/next_tick.js:74:11)
at process._tickCallback (internal/process/next_tick.js:98:9)
I've found this link which illustrates the same issue with big files
You can test it with
wget http://download.geonames.org/export/dump/allCountries.zip
unzip allCountries.zip
sed -i '1s/^/geonameid\tname\tasciiname\talternatenames\tlatitude\tlongitude\tfeature_class\tfeature_code\tcountry_code\tcc2\tadmin1_code\tadmin2_code\tadmin3_code\tadmin4_code\tpopulation\televation\tdem\ttimezone\tmodification_date\n/' allCountries.txt
time tsv2json < allCountries.txt > allCountries-pre.json
Do you have a recommended way to parse big files using either command line or via API ?
Note that it's working well with csv-parser :
cat allCountries.txt | csv-parser -s $'\t' > allCountries-pre.json
This is not a streaming parser, so it is subject to Node’s buffer size limitations. This failure is occurring before it even gets to parsing; it’s just trying to decode the input file bytes into a string.
The way to fix this is to rewrite this library to be streaming. That’s doable, but it requires a new API. (The CLI could remain unchanged, however.) This request has already been filed at #20. It’d be a nice improvement, however I have no immediate plans to work on it.