Make concurrent writer part of the API
ennerf opened this issue · comments
Below are some thoughts regarding making concurrent writes part of the supported API. Overall, there are 4 different cases when writing MAT files.
- uncompressed: all sizes are known beforehand, and all entries can be streamed into the final output. This is mostly limited by memory, so multiple threads wouldn't help
- single threaded compression with seek support (default): entries get compressed into the root sink and the size gets updated after the data is added.
- single threaded compression without seek support: entries need to be compressed into a temporary buffer so that the size can be added without seeking. Does not support writing subsystem offsets.
- concurrent compression: entries are compressed into temporary buffers and then combined into a single file. Requires seeking when writing subsystem offsets.
Currently, cases 1 & 2 are implemented by Mat5Writer
, and 4 is implemented by the experimental ConcurrentMat5Writer
. It'd be good to add official support for both.
Case 3 is currently not supported. However, I'm not sure whether writing to a non-seeking root sink is a common enough use case to add it. I expect the output in >99% of the cases to be memory (supported by ByteBuffer) or a file (supported by streaming and mapped files). The only other use case I can think of is streaming a mat file into e.g. a TCP stream, but for that we could probably write to memory or file first. Additionally, the same behavior can be achieved by resetting a buffer and writing individual entries into a single ByteBuffer sink.
Combining 4 with 1 & 2 gets a bit weird in case users enable/disable compression when doing incremental writes. Writing uncompressed data directly would change the order of the entries. This could be fixed by
- set compression level in factory method and remove setter from the API: changing compression level is likely a very little used feature. This would also allow the deflating sink to be reused (single threaded case) by clearing the deflator state
- write uncompressed into temporary buffer: wastes a buffer copy that isn't needed
- write uncompressed in finalization step
- write uncompressed immediately, and ignore that some entries may potentially be re-ordered (although the Subsystem still needs to be at the end)
If the compression level is fixed in the constructor, Mat5Writer
could become an abstract class and uncompressed
/ compressed
/ concurrent
could be different implementations. Would that make sense, and what should the API look like? The concurrent writer would need an additional step for combining the temporary buffers into a single file.
- Would this make sense in the
close()
method? - Should users be required to always call
close()
even on single threaded writers? Would there be a case where the sink would need to be reused, and closing would be a bad idea? - Maybe follow the OutputStream convention and add a
close()
as well as aflush()
method. Theflush
method could finalize the submitted writes and be a no-op for the single writer case.close()
would also callflush()
.