NTNDArray Blosc compression byte order

Question

NTNDArray Blosc compression byte order

mrkraimer opened this issue 6 years ago · comments

The NDPluginCodec supports scalar arrays of all numeric types: int8,uint8,...,int64,uint64.
In all cases the compresed array will have type byte (which is the same as int8)
For all except int8 and uint8, if client and server have different byte order, then byte order must be switched by either client or server.

Lets assume the server compresses and client decompresses.
Then it should be the client that switches byte order after decompression.
In order to do this the NTNDArray.codec.attribute structure most have fields like:

    bool serverByteOrderBigEndian
    bool clientByteOrderBigEndian

If these differ than the client must switch byte order

In the C code for blosc there are methods:

int blosc_compress(int clevel, int doshuffle, size_t typesize,
                 size_t nbytes, const void *src, void *dest,
                 size_t destsize);

and

BLOSC_EXPORT int blosc_decompress(const void *src, void *dest, size_t destsize);

The doshuffle argument can be one of

 #define BLOSC_NOSHUFFLE   0  /* no shuffle */
 #define BLOSC_SHUFFLE     1  /* byte-wise shuffle */
 #define BLOSC_BITSHUFFLE  2  /* bit-wise shuffle */

I think that BLOSC_SHUFFLE just means switch byte order.

Only compress has an argument to switch byte order.

But we want client to switch the byte order.

There is also a method:

void
shuffle(const size_t bytesoftype, const size_t blocksize,
      const uint8_t* _src, const uint8_t* _dest);

The client can call this if byte order needs to be changed.

BUT The Java blosc code does not provide this method.
Thus fir

Mark Rivers · Answer 1 · Wed Dec 05 2018 02:58:01 GMT+0800 (China Standard Time)

I think that BLOSC_SHUFFLE just means switch byte order.

No, that is not correct. BLOSC_SHUFFLE is an additional operation to improve compression. It is exposed in the NDPluginCodec, and by selecting BLOSC_SHUFFLE or BLOSC_BITSHUFFLE the compression can be greatly improved.

NDPluginCodec passes the shuffle argument on compression:

    int compSize = blosc_compress_ctx(clevel, shuffle, info.bytesPerElement,
            info.totalBytes, input->pData, output->pData, output->dataSize,
            compname, blockSize, numThreads);

But it is transparent on decompression, because the shuffle is encoded in the byte stream:

    int ret = blosc_decompress_ctx(input->pData, output->pData,
            output->dataSize, numThreads);

I suspect the Blosc compressor assumes the input is native byte order of the host, compresses into a well-defined stream of bytes which is identical whether the host is big-endian or little-endian. The decompressor knows the datatype of what it is decompressing and converts to the native endianness of the machine doing the decompressing. One reason I think this is that the Blosc compressor is widely used for compressing in files like HDF5 which are commonly written and read on machines with different endianness. They need to make it transparent.

Marty Kraimer · Answer 2 · Wed Dec 05 2018 03:09:13 GMT+0800 (China Standard Time)

OK thanks for this information.
I will close this issue.

Mark Rivers · Answer 3 · Wed Dec 05 2018 03:14:29 GMT+0800 (China Standard Time)

I'm not saying I am certain of this, so it is definitely worth testing.

I don't have a big-endian machine to test with, because mine all run vxWorks which does not support Blosc compression.

Bruno Martins · Answer 4 · Wed Dec 05 2018 21:58:59 GMT+0800 (China Standard Time)

We could encode byte order in the codec.params NTNDArray field. Currently NDPluginCodec puts a PVInt32 there to indicate the original datatype, but there's nothing preventing us from setting a more complex structure to params

Mark Rivers · Answer 5 · Wed Dec 05 2018 22:13:50 GMT+0800 (China Standard Time)

Does anyone have a big-endian machine we can test with? I am not sure there is really a problem.

Here are 2 notes from the Blosc release notes: https://github.com/Blosc/c-blosc/blob/master/RELEASE_NOTES.rst

Changes from 1.11.2 to 1.11.3
Fixed #181: bitshuffle filter for big endian machines.

Changes from 0.9.3 to 0.9.4
Support for cross-platform big/little endian compatibility in Blosc headers has been added.

Marty Kraimer · Answer 6 · Wed Dec 05 2018 23:18:02 GMT+0800 (China Standard Time)

But what does the following mean?

Support for cross-platform big/little endian compatibility in Blosc headers has been added.

I think that it only means that it handles byte order in it's private fields.
I looked briefly at source code and this looks like it is what it is doing.
Sounds like a test is required.
Sorry I do not have access to two systems with different byte orders.

Marty Kraimer · Answer 7 · Thu Dec 06 2018 04:39:52 GMT+0800 (China Standard Time)

Note that if we have to switch byte order then java.nio.ByteBuffer has methods:

public final ByteOrder order()
Retrieves this buffer's byte order.
The byte order is used when reading or writing multibyte values, 
and when creating buffers that are views of this byte buffer.
The order of a newly-created byte buffer is always BIG_ENDIAN.

and

public final ByteBuffer order(ByteOrder bo)
Modifies this buffer's byte order.
Parameters:
bo - The new byte order, either BIG_ENDIAN or LITTLE_ENDIAN
Returns:
This buffer