silx-kit / hdf5plugin

Make compression filters and filters enabled embedded compression fail when compressed size exceeds size of raw input data. This would make for example bitshuffle filter with lz4 compression enabled allow to behave conformant to the description of the H5Z_FLAG_OPTIONAL flag described in the The [Defining and Querying the Filter Pipeline] (https://docs.hdfgroup.org/hdf5/develop/_f_i_l_t_e_r.html) section in the libhdf5 manual.


Values for flags                Description

H5Z_FLAG_OPTIONAL  If this bit is set then the filter is optional.  If the filter fails (see below) during an H5Dwrite() operation
                                        then the filter is just excluded from the pipeline for the chunk for which it failed; the filter will not
                                        participate in the pipeline during an H5Dread() of the chunk.
 	                               This is commonly used for compression filters: if the compression result would be larger than the input
                                       then the compression filter returns failure and the uncompressed data is stored in the file. If this bit is
                                       clear and a filter fails then the H5Dwrite() or H5Dread() also fails.

At least for me it would be more natural and thus expected behaviour that data is only stored compressed when there is actually a benefit in terms of size from compression. Further i do not consider it the applications task to decide whether to compress a dataset or not to compress a dataset based just upon a wild guess whether the data will be compressible and thus likely will need less bytes when stored in compressed form compared to its uncompressed representation. This decision can only be made when actually compressing the data and figuring whether the extra bytes necessary for header, housekeeping, code-tables and other necessary bits would cause the resulting chunk be smaller than the input or cause it to exceed the input size.

An Example: the array a=numpy.np.array([2,3],dtype=np.float32) covers in non compressed form exactly 8 bytes. When compressing with bitshuffle + lz4 it ends up in the hdf5 file with the following storage layout as reported by h5dump

      STORAGE_LAYOUT {
         CHUNKED ( 2 )
         SIZE 20 (0.400:1 COMPRESSION)
      }

If i do read that correctly htis means the compressed array is expanded by a factor of 2.5 to 20 bytes so one raw byte covers in compressed outbut 2.5 bytes. Eventhough this example is very artificial it is still a waste especially when arbitrary data it is still indicates that any data independent of its actual size which is poorly compressible even with prepended bithshuffle filter will increase the size of the hdf5 file instead of reducing it at least slightly.

This seems to be managed outside of the compression filters when calling HDF5 to store compressed data.

When using h5py, the H5Z_FLAG_OPTIONAL is set by default when using compression: see doc and here in the code.

In the example you provided, most of the difference in file size can be explained by whether the dataset is chunked or not:

import hdf5plugin, h5py, numpy

with h5py.File('chunked_uncompressed.h5', 'w') as f: 
     f.create_dataset('data', data=data, chunks=(2,))

with h5py.File('compressed.h5', 'w') as f: 
     f.create_dataset('data', data=data, compression=hdf5plugin.Bitshuffle())

File size:

chunked_uncompressed.h5: 3504 bytes
compressed.h5: 3516 bytes
and the remaining bytes can be explained by the additional filter information to store for the compressed data.

For the gzip filter it has to be sayed that zlib supports compression chunk by chunk allowing to start with a small buffer and expanding when more space is needed to store the compressed data and this is used by libhdf5 to dertermine wether it is more efficient to store the gzip compressed copy of the data or the raw data. It does it by allocating for the output a buffer which has the same size as the input. The actual behaviour is implemented inside the gizip filter which is handled by the HZ5_filter_deflate function which is the actual filter function as can be seen from Hz5_Defllate.c ion the sources of libhdf5 starting at lines 155 down. There it allocates nbytes as output buffer and when libz compress2 returns Z_BUFF_ERROR than compression is aborted as the nbytes are exceeded.

In this light the following thoughts.

As far as I do understand, please correct me if I'm wrong. Bit-Shuffling per se when

with h5py.File('compressed.h5', 'w') as f: 
     f.create_dataset('data', data=data, compression=hdf5plugin.Bitshuffle(lz4=False))

is just an method to preprocess data such that it is more likely that it can very efficiently be compressed by a compression filter for example lz4. Thus BitShuffle filter with lz4 = False is just bit-shuffling but not compressing the data. In this case it perfectly fine when the output additionally to the input contains a few extra bytes for housekeeping . But when

with h5py.File('compressed.h5', 'w') as f: 
     f.create_dataset('data', data=data, compression=hdf5plugin.Bitshuffle(lz4=True))

, which is the default , than I would expect that the output is smaller than the input. If I'm not mixing things than lz4 is a quite efficient and power full compression algorithm especially when it can benefit from data preprocessed for example by BitShuffle filter. And when than data excluding metadata and attributes stored on h5 file covers 20 bytes for 8 bytes worth of input than this is not quite what i would expect. At least on a classical disk or virtual filesystem when i encounter, that the gziped copy of a file i processed with gzip, requires more space than the not compressed filed i rather tend to keep the raw and discard its gzip'ed copy.

So if you see any possibility to make for example BitShuffle(lz4=True) fail with some "compressed output exceeds error" when noutbytes +12 > size * elementsize than result would more like as expected from the the description of the H5Z_FLAG_OPTIONAL flag.

Hi,

First of all hdf5plugin is a packaging project: it aims at providing already existing HDF5 compression filters readily available for h5py. We also aim at providing those filters unmodified in order to avoid incompatibilities.

So, issues like the one you are raising are best discussed on those filter's project.
The list of embedded filters is available here, including the bitshuffle filter.
And when a compression filter makes a new release, we'd update it in hdf5plugin.

BTW, other filters like blosc looks to care about this:

hdf5plugin/src/hdf5-blosc/src/blosc_filter.c

Lines 197 to 203 in 505aad8

    
               /* Allocate an output buffer exactly as long as the input data; if 
        
                  the result is larger, we simply return 0.  The filter is flagged 
        
                  as optional, so HDF5 marks the chunk as uncompressed and 
        
                  proceeds. 
        
               */ 
        
               outbuf_size = (*buf_size);

Thank you very much i will do. I just wanted to cross check first with you.

	/* Allocate an output buffer exactly as long as the input data; if
	the result is larger, we simply return 0. The filter is flagged
	as optional, so HDF5 marks the chunk as uncompressed and
	proceeds.
	*/

	outbuf_size = (*buf_size);

Make compression filters fail when compressed size exceeds size of raw input