DFO-CHS-Dynamic-Hydrographic-Products / IWLS_pygeoapi

pygeoapi plugins to access and process water level and surface currents from the IWLS public API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Default behaviour of variable length strings

princessmittens opened this issue · comments

After more reading, default behavior of variable length strings in datasets for h5py are specifically encoded as byte strings. This is something we can look into further but I am not completely sure why byte strings are the default (compressibility maybe?). Given that is the default, perhaps we should consider keeping it as is and using the dataset.asstr() below to parse it. I have tested this and it works for the compound types as well.

From the h5py docs on strings:

String data in HDF5 datasets is read as bytes by default: bytes objects for variable-length strings, or numpy bytes arrays ('S' dtypes) for fixed-length strings. Use Dataset.asstr() to retrieve str objects.

Their reasoning for switching to byte strings as default in h5py 3.x is that they can't guarantee that strings have correct UTF-8 encoding, but I don't think it actually affect anything in the production code. We only modify or create dataset, we never need to read them. It probably something to keep in mind for tests.

Default string encoding while reading hdf files is dependent on the version of the h5py library used. There are no changes required on our end.