HDF5 Support

Question

HDF5 Support

dioptre opened this issue 4 years ago · comments

Hi there,

Just wondering if you thought of adding hdf5 support for the more widely used format and if there are any issues against?

Thanks
Andrew

Chadwick Boulay · Answer 1 · Sat Aug 15 2020 05:19:25 GMT+0800 (China Standard Time)

Hi Andrew,

I don't think hdf5 lends itself well to having multiple interleaved streams of data. I might be wrong, but I don't think it can have multiple datasets that are all able to grow arbitrarily in size after they are created. Can it? I know it can have 1, but I seem to recall facing a problem when I tried to have more than 1.

Even if it can have multiple resizable datasets, HDF5 is just a container. The organization of the metadata, datasets, timestamps, etc don't have any universal specification. The only thing that would improve (over xdf) from a user perspective is that they could use any widely available hdf5 importer to get the data in memory, but they would still have to write a custom layout-specific importer to get an intuitive representation of the data. This is arguably worse than using the existing xdf importers we have.

So then we would need to specify a data layout, and create (and maintain) xdfh5 tools.

There are some nice tools for hdf5 (e.g. Dask) that have some great features when loading large data... so I see the value in hdf5 support. However, I think this value can be more easily obtained with an xdf to hdf5 converter. You can already do this pretty easily with NeuroPype and not really lose any information, but you still encounter the problem of having to know the neuropype h5 layout when you want to load it with tools other than neuropype. I think NeuroPype can also write to nwb. If it doesn't yet then it should be able to soon.

I also have the opinion that importing arbitrary h5 files in Matlab kind of sucks and I don't see it as an improvement over the xdf-Matlab importer.

Andrew Grosser · Answer 2 · Thu Aug 20 2020 03:22:02 GMT+0800 (China Standard Time)

This is an interesting argument of decarative vs imperative-ness of storing data. I see what you're saying regarding not being explicit enough and losing the meaning of data - that said we can use metadata - so I'm not sure if you'd lose any specification info. I don't think there's an issue of multiple resizable datasets.

On the other hand, you'd be able to use the format nearly everywhere and in whatever language you'd like... which would open us to use more sophisticated tooling than matlab? Perhaps ubiquity might be a good option over specialization? But its an old ongoing argument in the tech world.

Chadwick Boulay · Answer 3 · Thu Aug 20 2020 04:17:12 GMT+0800 (China Standard Time)

that said we can use metadata - so I'm not sure if you'd lose any specification info

We'd still have to agree on the specifications. What will the field names be exactly? What is the type and layout of the information they hold? etc. But actually maybe it's just better to create & use a specification that is an extension of NWB? e.g. NWB:X? (I would suggest NWB:N, but I don't think that holds all of the information we need for different modalities. LSL ecosystem records more than just EEG.)

I would be interested to see a pull request wherein LabRecorder users have the option of saving either to XDF or to NWB.

But, that seems like a lot of work and the value added isn't a whole lot more than what you would get with the comparatively easier work to write a lightweight pyxdf --> pynwb converter, if for some reason you aren't happy with the conversion options available in NeuroPype or MNE-Python. This tool could also come with an anonymization feature and a BIDS validator. I think there's a lot of value here.

There's a much bigger problem that I forgot to mention before.
XDF stores clock offsets between the recording computer and every stream it records. Then, upon loading, the xdf importers (pyxdf and xdf-Matlab) use those clock offsets to synchronize the streams. While LSL provides functionality to do the clock adjustments in real time, in practice it is better and more accurate to do this synchronization offline after all the data and clock offsets have been recorded.

So, if we were to write to another format, we would have to either:

Do the synchronization online; this gives a worse experience than XDF.
Store the clock offsets in the HDF file somewhere, and then when people load the HDF file using something other than the official importer we provide and inevitably ignore the offsets, they will complain about how terrible the synchronization is and/or bombard us with questions about how to synchronize.

So once again, a converter is a much better option.

Robert Guggenberger · Answer 4 · Thu Aug 20 2020 12:37:08 GMT+0800 (China Standard Time)

I agree fully with Chadwick. HDF is not suitable for continuous recording. So we would need a converter first. But a format like xdf will be more compact, and we already have xdf loaders for many languages. So where is the benefit of hdf?

Andrew Grosser · Answer 5 · Sun Aug 23 2020 03:56:24 GMT+0800 (China Standard Time)

My use-case is that I'm interested in far more than neuro data (just to fill in the background). It's interesting about doing the clock offsets post processing. Totally understand why you like xdf now. Perhaps is that something that we could improve in the underlying LSL protocol? Wonder if we can include sync and buffering while writing realtime. We use our own format in neuromore.com studio but I am pushing the team more towards open source - so interested to see what best practices we can develop with the community. NWB looks like what I might be after - super interesting, thanks @cboulay

Not sure what you are on about @agricolab

Robert Guggenberger · Answer 6 · Tue Aug 25 2020 05:17:00 GMT+0800 (China Standard Time)

In my limited experience, manipulating or extending an hdf5 file usually results in bloated files. see https://support.hdfgroup.org/HDF5/doc/H5.user/Performance.html
That's why i feel they are not as suitable for online recording of data with variable chunk size, and why it seemed to me the approach which will result in the most compact files is a post-hoc conversion from xdf to hdf.
But i might just not be clever enough to understand the intricacies of how to set up extendible hdfs.

Andrew Grosser · Answer 7 · Fri Aug 28 2020 01:54:55 GMT+0800 (China Standard Time)

Interesting @agricolab, I haven't come across that issue myself as I don't remove chunks from the data I'm capturing. From what I read though if you do get bloat you can rewrite the file and it will free up the space. Astronomers use h5 so I'm sure it can capture streaming sufficiently at massive bitrates. I'm a fan of using standards that way you benefit from so much extra tooling and platform support.