Make compression filters fail when compressed size exceeds size of raw input
hernot opened this issue · comments
Make compression filters and filters enabled embedded compression fail when compressed size exceeds size of raw input data. This would make for example bitshuffle filter with lz4 compression enabled allow to behave conformant to the description of the H5Z_FLAG_OPTIONAL
flag described in the The [Defining and Querying the Filter Pipeline] (https://docs.hdfgroup.org/hdf5/develop/_f_i_l_t_e_r.html) section in the libhdf5 manual.
Values for flags Description
H5Z_FLAG_OPTIONAL If this bit is set then the filter is optional. If the filter fails (see below) during an H5Dwrite() operation
then the filter is just excluded from the pipeline for the chunk for which it failed; the filter will not
participate in the pipeline during an H5Dread() of the chunk.
This is commonly used for compression filters: if the compression result would be larger than the input
then the compression filter returns failure and the uncompressed data is stored in the file. If this bit is
clear and a filter fails then the H5Dwrite() or H5Dread() also fails.
At least for me it would be more natural and thus expected behaviour that data is only stored compressed when there is actually a benefit in terms of size from compression. Further i do not consider it the applications task to decide whether to compress a dataset or not to compress a dataset based just upon a wild guess whether the data will be compressible and thus likely will need less bytes when stored in compressed form compared to its uncompressed representation. This decision can only be made when actually compressing the data and figuring whether the extra bytes necessary for header, housekeeping, code-tables and other necessary bits would cause the resulting chunk be smaller than the input or cause it to exceed the input size.
An Example: the array a=numpy.np.array([2,3],dtype=np.float32)
covers in non compressed form exactly 8 bytes. When compressing with bitshuffle + lz4 it ends up in the hdf5 file with the following storage layout as reported by h5dump
STORAGE_LAYOUT {
CHUNKED ( 2 )
SIZE 20 (0.400:1 COMPRESSION)
}
If i do read that correctly htis means the compressed array is expanded by a factor of 2.5 to 20 bytes so one raw byte covers in compressed outbut 2.5 bytes. Eventhough this example is very artificial it is still a waste especially when arbitrary data it is still indicates that any data independent of its actual size which is poorly compressible even with prepended bithshuffle filter will increase the size of the hdf5 file instead of reducing it at least slightly.
This seems to be managed outside of the compression filters when calling HDF5 to store compressed data.
When using h5py
, the H5Z_FLAG_OPTIONAL
is set by default when using compression: see doc and here in the code.
In the example you provided, most of the difference in file size can be explained by whether the dataset is chunked or not:
import hdf5plugin, h5py, numpy
with h5py.File('chunked_uncompressed.h5', 'w') as f:
f.create_dataset('data', data=data, chunks=(2,))
with h5py.File('compressed.h5', 'w') as f:
f.create_dataset('data', data=data, compression=hdf5plugin.Bitshuffle())
File size:
chunked_uncompressed.h5
: 3504 bytescompressed.h5
: 3516 bytes
and the remaining bytes can be explained by the additional filter information to store for the compressed data.
For the gzip filter it has to be sayed that zlib supports compression chunk by chunk allowing to start with a small buffer and expanding when more space is needed to store the compressed data and this is used by libhdf5 to dertermine wether it is more efficient to store the gzip compressed copy of the data or the raw data. It does it by allocating for the output a buffer which has the same size as the input. The actual behaviour is implemented inside the gizip filter which is handled by the HZ5_filter_deflate function which is the actual filter function as can be seen from Hz5_Defllate.c ion the sources of libhdf5 starting at lines 155 down. There it allocates nbytes as output buffer and when libz compress2
returns Z_BUFF_ERROR
than compression is aborted as the nbytes are exceeded.
In this light the following thoughts.
As far as I do understand, please correct me if I'm wrong. Bit-Shuffling per se when
with h5py.File('compressed.h5', 'w') as f:
f.create_dataset('data', data=data, compression=hdf5plugin.Bitshuffle(lz4=False))
is just an method to preprocess data such that it is more likely that it can very efficiently be compressed by a compression filter for example lz4. Thus BitShuffle filter with lz4 = False
is just bit-shuffling but not compressing the data. In this case it perfectly fine when the output additionally to the input contains a few extra bytes for housekeeping . But when
with h5py.File('compressed.h5', 'w') as f:
f.create_dataset('data', data=data, compression=hdf5plugin.Bitshuffle(lz4=True))
, which is the default , than I would expect that the output is smaller than the input. If I'm not mixing things than lz4 is a quite efficient and power full compression algorithm especially when it can benefit from data preprocessed for example by BitShuffle filter. And when than data excluding metadata and attributes stored on h5 file covers 20 bytes for 8 bytes worth of input than this is not quite what i would expect. At least on a classical disk or virtual filesystem when i encounter, that the gziped copy of a file i processed with gzip, requires more space than the not compressed filed i rather tend to keep the raw and discard its gzip'ed copy.
So if you see any possibility to make for example BitShuffle(lz4=True)
fail with some "compressed output exceeds error" when noutbytes +12 > size * elementsize
than result would more like as expected from the the description of the H5Z_FLAG_OPTIONAL
flag.
Hi,
First of all hdf5plugin
is a packaging project: it aims at providing already existing HDF5 compression filters readily available for h5py
. We also aim at providing those filters unmodified in order to avoid incompatibilities.
So, issues like the one you are raising are best discussed on those filter's project.
The list of embedded filters is available here, including the bitshuffle filter.
And when a compression filter makes a new release, we'd update it in hdf5plugin
.
BTW, other filters like blosc
looks to care about this:
hdf5plugin/src/hdf5-blosc/src/blosc_filter.c
Lines 197 to 203 in 505aad8
Thank you very much i will do. I just wanted to cross check first with you.