pbnjay / grate

A Go native tabular data extraction package. Currently supports .xls, .xlsx, .csv, .tsv formats.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support reading workbooks from []byte or io.Reader

mgyucht opened this issue · comments

Problem Statement

Today, the grate API requires that the workbook exists as a file on disk. However, in some scenarios, the workbook may not exist in the filesystem, such as when the workbook is part of an HTTP request body. The API itself only accepts a filename, not any more general type like []byte or io.Reader.

Context

I'm working on a small webserver which a user can upload an xls or xlsx file to, and some analysis will be performed on this file. Now I have to write some boilerplate (described below) to use this library. It would be somewhat simpler if it were possible to pass the workbook as bytes directly.

Workaround

For now, I am working around this issue by:

  1. Creating a temporary file with this extension and storing the request body there.
  2. Passing that file name to grate.
  3. Cleaning up the file after processing.

Proposal

Short-term, I think it is reasonable to implement an OpenReader(r io.Reader) function which writes the contents of the reader to a temporary file. It seems like the actual filetype used for registration doesn't affect the behavior of the library, only debug logging, so ensuring that the file name's extension match the true file type seems unnecessary now.

Looking through the current providers, all of them initially open the file using os.Open and then interact on the file handle. In the future, I think it should be possible to decouple the core library from the file system. One possible idea is to define a new OpenFuncBytes type accepting an io.Reader and modifying srcOpenTab to contain both an OpenFunc and OpenFuncBytes. When registering, you can register either with OpenFunc (in which case the OpenFuncBytes param can be automatically generated by writing the bytes to a temp file and calling OpenFunc) or with OpenFuncBytes (in which case an OpenFunc can be automatically generated by reading the bytes from a file).

Thank you for the detailed proposal @mgyucht !

This is a valid synopsis of a problem that still bugs me, and your application sounds very similar to the initial application that grate was designed for. The solution you propose is exactly how we handled it on that project, but nothing about the approach is specific to grate as a larger package.

At one point I considered making the interface more generic a la io.Reader, but I discovered that this approach would make it very difficult to handle auto-detection across all the potential file types I wanted to support. We would need at least io.ReadSeeker to be able to rewind the stream for each attempt, but zip-based formats like xlsx require a io.ReaderAt interface which is essentially incompatible with an efficient io.ReadSeeker.

An alternative solution would be to read the entire file into memory and read data from there. This is actually what the xls reader does under the hood, due to the mess of an encoding format that it uses. Even across the 400k files we had, this was fine because none of the xls files really approached "large" in modern terms. xls is an older format and large files would have been un-readable on the hardware of the day.

With xlsx however, we have plenty of example files of 1GB or more. Storing this much in memory is difficult to justify in a web application (or even a stateless backend process), especially when you consider decompression overhead and the need to deserialize into usable data structures. But... these files are perfectly usable when read from disk due to the separate streams available within the zip format.

I'm not sure that any of this can be handled well with a small PR, I suspect that a viable approach might be to incorporate fs.FS into the Open interface (something like a OpenFile(FS, filename)) but this isn't really much better, other than removing the os package from the parser dependencies and avoiding some of the challenges of managing temporary files within the package.

Thank you so much for writing such a thorough response! I'm very new to Go (this is my first project in the language), so thanks for taking so much time to explain your reasoning behind your design decision here. If I understand, you mention two points:

  1. Zip-based formats cannot have both an efficient ReaderAt and ReadSeeker implementation. Would you mind elaborating a bit more on this? These interfaces seem pretty similar to me, but I don't understand the implication on why it isn't possible for both to have efficient implementations on the same data structure.
  2. Some files that this library is used to process could be large enough that it would be infeasible to store them in memory. Do the io.Reader/io.ReaderAt/io.ReadSeeker interfaces make guarantees about this data being stored in memory? Is it possible for the user to choose a suitable implementation which could, for example, swap sections of the underlying data structure to disk to limit memory usage?
  1. The zip package only requires ReaderAt, but auto detection would require the ability to reset the read pointer to the beginning of the file for each format detector to run. ReadSeeker has to maintain a read head pointer, but to ALSO implement ReaderAt the implementation would have to save the read pointer, jump to the requested offset for ReadAt, then jump back to the saved pointer (and lock to a single thread). Especially with syscall overhead this would be very slow!

https://pkg.go.dev/io#ReaderAt

If ReadAt is reading from an input source with a seek offset, ReadAt should not affect nor be affected by the underlying seek offset.

Clients of ReadAt can execute parallel ReadAt calls on the same input source.

  1. If you load everything in memory, bytes.Reader already implements both interfaces. I’m sure that someone has implemented a mmap file interface that would work but that would A) definitely be outside the scope of this project, and B) be highly system dependent. As of right now, grate runs on basically everything I’ve thrown it at!

We could probably design an interface (something like a “ResetableReaderAt”) to support your use case. If we can make a lightweight wrapper on Files for that it might work nicely