Unidata / netcdf-fortran

Official GitHub repository for netCDF-Fortran libraries, which depend on the netCDF C library. Install the netCDF C library first.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failure with latest version with parallel, but not sequential writes

edwardhartnett opened this issue · comments

I am still gathering info on this issue. Here's what we know so far:

Kate Friedman:

So the parallel netcdf issue on Hera is a known issue (lustre issue we think) and as such we run low resolution with serial netcdf there. Used to force all resolutions to serial netcdf on Hera but recent updates in our develop branch changed that (see below).

We also see this issue on WCOSS2 with C384, which I believe GDIT continues to investigate. It didn't track with me at first that folks were hitting this issue on Hera, I thought we'd walled that issue off with resolution recommendations and such. A true cause has yet to be determined but from my experience it happens on lustre filesystems, we never saw this on the prior WCOSS gpfs filesystem. I ran a LOT of tests on WCOSS2 when it cropped up there during the transition. One solution was related to chunking (set it to zero and let the model define it at runtime instead of defining it beforehand) but, while this remedied the issue for the GFS, it slowed down downstream models who read the output with different chunking so we were forced to back that solution out of the C384 enkf forecast jobs in operational GFSv16. We run those C384 enkf forecast jobs with serial netcdf and threw more resources at them to speed them up in ops for now.

Here are the lines in our configs that define parallel vs serial netcdf by resolution (CASE) in our develop branch:

https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/config.fv3#L170:L180

Only with high resolution is parallel netcdf used but we only recommend high resolution on WCOSS2 (where there are decent resources for it) so users are not advised to run it elsewhere (for resources and the occasional failure of the type described above and by Judy).

I am not sure if this information is useful or adequate, at least on one of our systems the problems described above seem to have started after we had upgraded our Lustre client from:

lustre-client-2.12.6_ddn66-1.el7.x86_64

to

lustre-client-2.12.8_ddn18-1.el7.x86_64

The user has confirmed that it was working fine prior to the upgrade, and she repeated the same case again and it started failing with the following error:

921: file: module_write_netcdf.F90 line: 643 NetCDF: HDF error

Then she experimented by setting to change output file from "netcdf-parallel" to "netcdf" and it started working fine again.

Thank you!

OK, so this looks a lot like a problems with lustre and not with netCDF.

Can the upgrade of Lustre be rolled back?

Were the HDF5 tests run with the new version of Lustre? If HDF5 doesn't work, then netCDF-4 is not going to work.