ChunkedIO API is Incompatibile with CSV.parse

Question

ChunkedIO API is Incompatibile with CSV.parse

chrisnicola opened this issue 8 years ago · comments

I'm currently noticing this when working with some CSV files I'm storing in S3 using Shrine. The Down::ChunkedIO has a private gets method when CSV.parse is trying to call (yes I know CSV.parse shouldn't be parsing by lines but that's what it does).

The actual error is:

NoMethodError (private method `gets' called for #<Down::ChunkedIO:0x007f7116dbbad0>):

It would probably help for compatibility with other things that process IO if the gets method was public and processed chunks until it completed a line an then returned that line.

I should add that I worked around this issue by simply calling download instead of of to_io so this is hardly a big issue. I noticed that ChunkedIO is using a Tempfile as an intermediate anyways so it probably makes no difference. I'm curious though why not just process the IO in memory?

Janko Marohnić · Answer 1 · Tue Nov 22 2016 11:23:03 GMT+0800 (China Standard Time)

The Down::ChunkedIO has a private gets method when CSV.parse is trying to call

I was actually surprised by this, because Down::ChunkedIO doesn't define a gets method. It turns out that gets comes from Kernel and is defined on any object (just like puts), and is used for getting the user input from stdin.

It would probably help for compatibility with other things that process IO if the gets method was public and processed chunks until it completed a line an then returned that line.

Since gets is only related to parsing by lines (e.g. CSV), I don't think it would be worthwhile adding this method. Especially since CSV library also seems to depend on pos being defined on the object.

I should add that I worked around this issue by simply calling download instead of of to_io so this is hardly a big issue.

Yeah, I think this is the best solution, since the Tempfile is an actual IO object with all of the needed methods. I first thought I thought there would be performance gain in parsing while downloading, but I don't think that's true, because during CSV parsing the downloading is paused, so the download doesn't go full speed as it would without parsing.

I noticed that ChunkedIO is using a Tempfile as an intermediate anyways so it probably makes no difference. I'm curious though why not just process the IO in memory?

Keeping the downloaded content in memory (e.g. StringIO) would be infeasible for large files. If the file you're downloading is 4GB in size, that means that at one point the whole 4GB would be stored in memory, which could easily kill your servers. By caching the downloaded content to a file, we are using disk space rather than memory.

Chris Nicola · Answer 2 · Wed Nov 23 2016 05:50:23 GMT+0800 (China Standard Time)

I'm aware of the issue with StringIO and memory however StringIO shouldn't be the only way to process data in memory, no? This could be done just with a buffer no?

Janko Marohnić · Answer 3 · Wed Nov 23 2016 10:51:00 GMT+0800 (China Standard Time)

Hmm, I'm not sure what exactly you mean by "processing" data in memory. Down::ChunkedIO is only caching the downloaded content, which is needed in case anybody wants to rewind the object to read what was already downloaded (e.g used by metadata extraction in Shrine).