Implementation of HDF binary output

Question

Implementation of HDF binary output

martijnende opened this issue 4 years ago · comments

Martijn van den Ende commented 4 years ago

Introduction

Regardless of the time frame over which the HDF binary output is scheduled to be implemented, I think it would be useful at some point to explore this (as discussed in #42). Particularly for large-scale (3D) simulations, the data volume can impose some limitations on the post-processing (memory, disk space, IO time). Luckily for us, HDF has a design principle that is similar to an OS file system, so that we can exploit particular structures in the data.

Proposed data structure

With reference to the figure below, it is possible to store the various data types (time series, snapshots) in groups and subgroups, starting with the "root" group, and metadata can be assigned to each each (sub)group header. The data contained in a given group can be shared with other groups, which enables us to share/recycle invariant data (similar to symbolic links in file systems, or relational databases in SQL environments). Lastly, instead of garden-variety data (integers, floats), files (e.g. the qdyn.in file) can also be directly attached to a group.

Concretely, for snapshot output, the OX mesh locations can be stored once and shared with each snapshot (1, 2, ..., N). The time of each snapshot is simply a meta tag, which eliminates 4 out of 10 OX quantities to be stored for each snapshot. An identical scheme applies to dynamic OX output (sampled at a different rate). Similarly, for the time series output the time vector (but also v_max, tau_max, etc.; not indicated in the figure) is shared between OT and IOT, each sampled at a different location (1, 2, ... K). OT and IOT output are consequently generalised to a common data structure.

For reading the simulation output, one would no longer need to read the entire output file into memory. Instead, one selects only the relevant data columns of a given group. So if I require the slip of snapshots from t = 10 to t = 25, I only need to check the meta data of the snapshot groups, and select the slip column for each of those that satisfy the selection criteria. The mesh locations need to be extracted only once. In addition, I believe it is also possible to slice columns: if I need time series v_max data from t = 100 to t = 500, I could select only a portion of the v_max column.

Implementation challenges

The data structure proposed above (column-based) is very different from the current output structure (row-based). While this is not necessarily a problem, it requires a bit more thought to implement. The procedure for creating and modifying an HDF5 file in Fortran is a little convoluted (see e.g. this SO answer for a "simple" implementation example). While playing around, I already lost 2 hours just trying to get a tutorial code compiled, so I expect that it will take some time so set-up and debug everything.

Even though HDF5 overlays OpenMPI IO, I am not sure whether the data structure proposed above is suitable for parallel IO. Since IO likely does not incur a large overhead for most simulations, it might be better to stick with a conventional MPI gather and do the IO in serial mode.

My biggest fear is that the data will likely be corrupted if the HDF file (and its subspaces) is not closed properly. So when a simulation crashes or is manually terminated, all of the simulation data may be lost (which does not happen with ASCII file formats). Since we all work very hard to eliminate all bugs and instabilities, I'm not so much worried about crashing simulations, but it happens often that I terminate a simulation at an early stage (e.g. after the post-seismic phase) to inspect the data. In C++ and Python it is possible to intercept a keyboard interrupt and terminate things safely, but in Fortran this seems to be a bit tricky (from quickly browsing through this PDF). We'll have to see in practice how we could best implement a deconstructor triggered by various exceptions.