silx-kit / hdf5plugin

Set of compression filters for h5py

Home Page:http://www.silx.org/doc/hdf5plugin/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LZ4 from Intel's ipp ?

jonwright opened this issue · comments

For the bitshuffle (LZ4) filter (on Intel) it seems the Intel IPP decompressor can be faster than the LZ4 which is currently bundled.

The licensing situation is not entirely clear to me but I think Intel makes this code available as a binary blob in oneapi and also on pypi . From https://software.intel.com/content/www/us/en/develop/articles/oneapi-commercial-faq.html : it says "Yes, all of the oneAPI Toolkits are available for free download and use for commercial and non-commercial purposes". Blosc also has an option to build against IPP, so presumably it is possible.

One option might be to build against a special LZ4 rather than the one which is bundled. There is a recipe from Intel for building a patched LZ4 library which calls their versions instead, from: $(intel)/oneapi/ipp/latest/components/components/interfaces/ipp_lz4/readme.html

$ unzip lz4-1.9.3.zip
$ cd lz4-1.9.3
$ patch -p1 < ../lz4-1.9.3.patch.bin

Another method would be to send the calls to LZ4_decompress_fast and LZ4_decompress_safe to DecodeLZ4 from ipp. This means locating the include+lib folders for ipp when compiling. Doing something like this in setup.py :
extra_objects = [ 'path_to/libippdc.a', 'path_to/libippcore.a' ]
does not appear to introduce a new runtime dependency.

Before going much further into this I guess you want to see a benchmark to find out if it makes any difference and check it does not makes things worse for amd, etc ? Or maybe the binary blob already creates other problems ?

Looks interesting.

As you say, it would need some benchmark since it would add some complexity to the compilation machinery and a dependency.

BTW, IPP is supported by c-blosc v2 but apparently not by c-blosc v1 which is the one used by the hdf5 filter, so it's not an option for blosc filter

IPP licensing should not be an issue: we anyway won't bundle it nor redistribute binary wheels including IPP as long as it is a build from source option.

Is this of renewed actuality now that Blosc 2 is added to the zoo?

Good question. I just tried with git main and hdf5plugin seems to be using the lz4 that is inside c-blosc (1.9.3) rather than c-blosc2 (1.9.4) (because of

lz4_dir = glob('src/c-blosc/internal-complibs/lz4*')[0]
?). Is there a flag to get it to use blosc2 instead? There is an orphan lz4 in bitshuffle/lz4 as well.

It looks like it should be possible, but reading the setup.py, I guess you are replacing the blosc2 lz4 that has the IPP option with the one from blosc that does not? I didn't see how to configure it to use IPP as well (-DDEACTIVATE_IPP=OFF). Or is it building with cmake or meson now instead of setup.py?

The reason this IPP thing looking interesting was to get the benefit of runtime dispatch: https://www.intel.com/content/www/us/en/develop/documentation/dev-guide-ipp-for-oneapi/top/ipp-theory-of-operation/dispatching.html

This is what I am currently seeing from hdf5plugin (main branch):

Thread 1 "python3" hit Breakpoint 1, LZ4_decompress_safe (source=source@entry=0x1555387bc35e "\217\b", dest=dest@entry=0xf17260 "", compressedSize=compressedSize@entry=272, 
    maxDecompressedSize=maxDecompressedSize@entry=8192) at src/c-blosc/internal-complibs/lz4-1.9.3/lz4.c:2172
2172    {
(gdb) bt
#0  LZ4_decompress_safe (source=source@entry=0x1555387bc35e "\217\b", dest=dest@entry=0xf17260 "", compressedSize=compressedSize@entry=272, 
    maxDecompressedSize=maxDecompressedSize@entry=8192) at src/c-blosc/internal-complibs/lz4-1.9.3/lz4.c:2172
#1  0x0000155539ad4d99 in bshuf_decompress_lz4_block (C_ptr=<optimized out>, size=size@entry=2048, elem_size=elem_size@entry=4, option=option@entry=0)
    at src/bitshuffle/src/bitshuffle.c:99
#2  0x0000155539ad53ec in bshuf_blocked_wrap_fun._omp_fn.0 () at src/bitshuffle/src/bitshuffle_core.c:1696
#3  0x0000155539a9a8e6 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
#4  0x0000155539ad80e7 in bshuf_blocked_wrap_fun (fun=fun@entry=0x155539ad4ce0 <bshuf_decompress_lz4_block>, in=<optimized out>, out=<optimized out>, size=4471016, 
    elem_size=4, block_size=2048, option=0) at src/bitshuffle/src/bitshuffle_core.c:1689
#5  0x0000155539ad5207 in bshuf_decompress_lz4 (in=<optimized out>, out=<optimized out>, size=<optimized out>, elem_size=<optimized out>, block_size=<optimized out>)
    at src/bitshuffle/src/bitshuffle.c:240
#6  0x0000155539ad4995 in bshuf_h5_filter (flags=<optimized out>, cd_nelmts=5, cd_values=0xf09350, nbytes=510867, buf_size=0x7fffffffc738, buf=0x7fffffffc728)
    at src/bitshuffle/src/bshuf_h5filter.c:143
#7  0x000015555435b843 in H5Z_pipeline () from /home/esrf/wright/.local/lib/python3.8/site-packages/h5py/../h5py.libs/libhdf5-fc7245dc.so.200.2.0
#8  0x00001555540f29eb in H5D__chunk_lock () from /home/esrf/wright/.local/lib/python3.8/site-packages/h5py/../h5py.libs/libhdf5-fc7245dc.so.200.2.0
#9  0x00001555540f3b07 in H5D__chunk_read.part.20 () from /home/esrf/wright/.local/lib/python3.8/site-packages/h5py/../h5py.libs/libhdf5-fc7245dc.so.200.2.0
#10 0x00001555541118ff in H5D__read () from /home/esrf/wright/.local/lib/python3.8/site-packages/h5py/../h5py.libs/libhdf5-fc7245dc.so.200.2.0
#11 0x000015555434d609 in H5VL__native_dataset_read () from /home/esrf/wright/.local/lib/python3.8/site-packages/h5py/../h5py.libs/libhdf5-fc7245dc.so.200.2.0
#12 0x0000155554337fcd in H5VL_dataset_read () from /home/esrf/wright/.local/lib/python3.8/site-packages/h5py/../h5py.libs/libhdf5-fc7245dc.so.200.2.0
#13 0x00001555541109df in H5Dread () from /home/esrf/wright/.local/lib/python3.8/site-packages/h5py/../h5py.libs/libhdf5-fc7245dc.so.200.2.0
#14 0x000015553ca3fd17 in __pyx_f_4h5py_4defs_H5Dread () at /project/h5py/defs.c:8526
#15 0x0000155539be35c3 in __pyx_pf_4h5py_9_selector_6Reader_2read () at /project/h5py/_selector.c:7527

The idea we have in mind is to use the same LZ4 (built as a static library) to build all the plugins.

I guess we'll switch to the newest one.

Any comment from you @t20100? To add an environment variable to configure IPP should not be a big problem.

Yes, we should switch to using blosc2 internal-complibs so the latest LZ4.
And also consider using zlib-ng instead of zlib.
It's just not done yet.

To add an environment variable to configure IPP should not be a big problem.

We already keep on adding some env var for every instruction set as there is a strong tendency to use more and more assembly code in compressors... so I don't see an issue adding one more.

@jonwright

A quick-and-dirty try from your side would be to copy the LZ4 sources from blosc2 to blosc and add the missing define.

If you need help, I am fairly free this afternoon.

From a quick look at it, it's not the LZ4 lib that is IPP-enabled, it's blosc2.

Then it should be a matter of chaning:

def get_blosc2_plugin():

by adding:

 define_macros.append(("HAVE_IPP", 1))

and:

libraries=[<IPP_LIBS>],

in here:

hdf5plugin/setup.py

Lines 827 to 838 in 1d93b5b

return HDF5PluginExtension(
"hdf5plugin.plugins.libh5blosc2",
sources=sources + \
prefix(hdf5_blosc2_dir, ['blosc2_filter.c', 'blosc2_plugin.c']),
extra_objects=get_zstd_clib('extra_objects'),
include_dirs=include_dirs + [hdf5_blosc2_dir],
define_macros=define_macros,
extra_compile_args=extra_compile_args,
extra_link_args=extra_link_args,
sse2=sse2_kwargs,
avx2=avx2_kwargs,
)

I'm wondering if an alternative could be a thin wrapper of IPP to give the very same API as LZ4.
Then it could be used for all filters based on LZ4 including blosc1 and bitshuffle.

For compatibility wrappers, Intel suggested patching LZ4 (recipe in the first issue). You could have a folder lz4 and lz4_patched and then choose which one to compile/link against.

I'm looking at adding optional IPP support.

FYI, here is a benchmark of IPP vs LZ4 made with c-blsoc2: Blosc/c-blosc2#313