ChunkedIO API is Incompatibile with CSV.parse
chrisnicola opened this issue · comments
I'm currently noticing this when working with some CSV files I'm storing in S3 using Shrine. The Down::ChunkedIO
has a private gets
method when CSV.parse
is trying to call (yes I know CSV.parse shouldn't be parsing by lines but that's what it does).
The actual error is:
NoMethodError (private method `gets' called for #<Down::ChunkedIO:0x007f7116dbbad0>):
It would probably help for compatibility with other things that process IO if the gets
method was public and processed chunks until it completed a line an then returned that line.
I should add that I worked around this issue by simply calling download
instead of of to_io
so this is hardly a big issue. I noticed that ChunkedIO
is using a Tempfile as an intermediate anyways so it probably makes no difference. I'm curious though why not just process the IO in memory?
The
Down::ChunkedIO
has a privategets
method whenCSV.parse
is trying to call
I was actually surprised by this, because Down::ChunkedIO
doesn't define a gets
method. It turns out that gets
comes from Kernel
and is defined on any object (just like puts
), and is used for getting the user input from stdin.
It would probably help for compatibility with other things that process IO if the
gets
method was public and processed chunks until it completed a line an then returned that line.
Since gets
is only related to parsing by lines (e.g. CSV), I don't think it would be worthwhile adding this method. Especially since CSV
library also seems to depend on pos
being defined on the object.
I should add that I worked around this issue by simply calling download instead of of to_io so this is hardly a big issue.
Yeah, I think this is the best solution, since the Tempfile
is an actual IO object with all of the needed methods. I first thought I thought there would be performance gain in parsing while downloading, but I don't think that's true, because during CSV parsing the downloading is paused, so the download doesn't go full speed as it would without parsing.
I noticed that
ChunkedIO
is using a Tempfile as an intermediate anyways so it probably makes no difference. I'm curious though why not just process the IO in memory?
Keeping the downloaded content in memory (e.g. StringIO
) would be infeasible for large files. If the file you're downloading is 4GB in size, that means that at one point the whole 4GB would be stored in memory, which could easily kill your servers. By caching the downloaded content to a file, we are using disk space rather than memory.
I'm aware of the issue with StringIO and memory however StringIO shouldn't be the only way to process data in memory, no? This could be done just with a buffer no?
Hmm, I'm not sure what exactly you mean by "processing" data in memory. Down::ChunkedIO is only caching the downloaded content, which is needed in case anybody wants to rewind the object to read what was already downloaded (e.g used by metadata extraction in Shrine).