Make concurrent writer part of the API

Question

Make concurrent writer part of the API

ennerf opened this issue 6 years ago · comments

Below are some thoughts regarding making concurrent writes part of the supported API. Overall, there are 4 different cases when writing MAT files.

uncompressed: all sizes are known beforehand, and all entries can be streamed into the final output. This is mostly limited by memory, so multiple threads wouldn't help
single threaded compression with seek support (default): entries get compressed into the root sink and the size gets updated after the data is added.
single threaded compression without seek support: entries need to be compressed into a temporary buffer so that the size can be added without seeking. Does not support writing subsystem offsets.
concurrent compression: entries are compressed into temporary buffers and then combined into a single file. Requires seeking when writing subsystem offsets.

Currently, cases 1 & 2 are implemented by Mat5Writer, and 4 is implemented by the experimental ConcurrentMat5Writer. It'd be good to add official support for both.

Case 3 is currently not supported. However, I'm not sure whether writing to a non-seeking root sink is a common enough use case to add it. I expect the output in >99% of the cases to be memory (supported by ByteBuffer) or a file (supported by streaming and mapped files). The only other use case I can think of is streaming a mat file into e.g. a TCP stream, but for that we could probably write to memory or file first. Additionally, the same behavior can be achieved by resetting a buffer and writing individual entries into a single ByteBuffer sink.

Combining 4 with 1 & 2 gets a bit weird in case users enable/disable compression when doing incremental writes. Writing uncompressed data directly would change the order of the entries. This could be fixed by

set compression level in factory method and remove setter from the API: changing compression level is likely a very little used feature. This would also allow the deflating sink to be reused (single threaded case) by clearing the deflator state
write uncompressed into temporary buffer: wastes a buffer copy that isn't needed
write uncompressed in finalization step
write uncompressed immediately, and ignore that some entries may potentially be re-ordered (although the Subsystem still needs to be at the end)

If the compression level is fixed in the constructor, Mat5Writer could become an abstract class and uncompressed / compressed / concurrent could be different implementations. Would that make sense, and what should the API look like? The concurrent writer would need an additional step for combining the temporary buffers into a single file.

Would this make sense in the close() method?
Should users be required to always call close() even on single threaded writers? Would there be a case where the sink would need to be reused, and closing would be a bad idea?
Maybe follow the OutputStream convention and add a close() as well as a flush() method. The flush method could finalize the submitted writes and be a no-op for the single writer case. close() would also call flush().