`[needed example]` streaming processing

Question

`[needed example]` streaming processing

harold opened this issue a year ago · comments

An example like the following would be very valuable:

Fifty 1GB .csv files (each with their own copy of the headers on the first row need to be concatenated

What is the fastest, most reliable way? --- don't want to wait any longer than necessary, and need to guarantee never oom.

Question: Is it worth thinking about when a second copy wont fit on the disk? Or do we assume we can always have the pre-concat copies together with the big output? (seems safe)

Bonus points: some column in the middle is mm/dd/yyyy, convert to yyyy-mm-dd during the processing.

Extra bonus points: two arbitrary columns are (1) mm/dd (with year expressed in filename) and (2) hh:mm:ss.nnn (with TZ expressed in filename) --- convert to something TMD will make into an #inst automatically.

Question: Preserve pre-processed columns, or no?

Chris Nuernberger · Answer 1 · Sat Aug 12 2023 01:46:19 GMT+0800 (China Standard Time)

https://cnuernber.github.io/charred/charred.bulk.html

Another operation in this space would be split-csv - left as an exercise to the reader for now :-)