jamesmudd / jhdf

A pure Java HDF5 library

Home Page:http://jhdf.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Global heap does not support reuse of elements

JCzogalla opened this issue · comments

Describe the bug
When refering to a global heap index more than once in one data set, the second reference becomes the empty string. This is due to the fact that the global heap object keeps a byte buffer as data element and aftrer reading it once it's position si at the limit, leading to an empty string the second time around.

To Reproduce
Use the attached file: var-length-strings-reused.zip
HDFView shows that values are reused muliple times and there are no empty strings. With jhdf, each value is present only once and the rest of the values are empty.

Expected behaviour
The output from jhdf should match the output from HDFView and resolve all global heap references accordingly.

Please complete the following information:

  • jhdf version: 0.4.8
  • Java version: 1.8
  • Stack trace/problem site: VariableLengthDatasetReader, l. 59 ff

Additional context
We see four possible fixes:

  1. Reset byte buffer after decoding. Most simple, but also slow (multiple decodings)
  2. Store data as byte array instead of buffer. Still limitations of 1)
  3. Decode to string directly when reading global heap, since the type/charset does not change for one dataset.
  4. Lazy decoding: Decode once, and store the decoed value in the object, getting rid of the byte buffer in the process. Needs more logic in the VariableLengthDatasetReader, but keeps the decoding to a minimum.

Reproduced and agree this is a bug. Interestingly in HDFview 3 I can't open this dataset either. I get
image

I'm tempted to go with the easy solution for now but very slightly different to your suggestion. Every time an object is requested from a GlobalHeap return a slice of the buffer. This will resolve the issue and slicing the buffer is cheap. If does potentially result in multiple decodings, however it also ensure concurrent access to a global heap is safe (at least for this case) which could be useful in the future.

I have merged a fix. Hopefully get a release done in the next few days. Thanks.

Just for completeness, we could read the dataset with HDFView 3.1.0. But the solution sounds good. Thanks!