`[needed example]` streaming processing
harold opened this issue · comments
An example like the following would be very valuable:
Fifty 1GB .csv files (each with their own copy of the headers on the first row need to be concatenated
What is the fastest, most reliable way? --- don't want to wait any longer than necessary, and need to guarantee never oom.
Question: Is it worth thinking about when a second copy wont fit on the disk? Or do we assume we can always have the pre-concat copies together with the big output? (seems safe)
Bonus points: some column in the middle is mm/dd/yyyy
, convert to yyyy-mm-dd
during the processing.
Extra bonus points: two arbitrary columns are (1) mm/dd
(with year expressed in filename) and (2) hh:mm:ss.nnn
(with TZ expressed in filename) --- convert to something TMD will make into an #inst
automatically.
Question: Preserve pre-processed columns, or no?
https://cnuernber.github.io/charred/charred.bulk.html
Another operation in this space would be split-csv - left as an exercise to the reader for now :-)