Unidata / netcdf-fortran

Official GitHub repository for netCDF-Fortran libraries, which depend on the netCDF C library. Install the netCDF C library first.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failure w/ mpicc in `nf03_test4/run_f90_par_test.sh`

WardF opened this issue · comments

Observing in the docker image I'm using for testing. Potentially a last minute block for the release.

See attached files for the error and installation details. Issue is happening with configure-based builds, haven't tested with cmake-based builds yet. Libraries libnetcdf and libnetcdff are being built on a system without zstd and libzstd-dev packages installed. The libnetcdf[f].settings files reflect that both are being built without zstandard support. run_f90_par_test.sh is failing when testing compressed data writes with parallel I/O. My immediate guess is that it is attempting to perform zstandard tests, maybe? Installing zstd and libzstd-dev correct this issue. Currently working on nailing down what's going on.

libnetcdf.settings.txt
libnetcdff.settings.txt
test-suite.log.txt

Installing zstd and libzstd-dev on the system, without recompiling libnetcdf but recompiling libnetcdff results in everything passing. The corresponding libnetcdff.settings still shows Zstandard support as no.

So you are saying netcdf-fortran is not correctly detecting that zstandard is not built in the C library?

@edwardhartnett No, that's the weird part, it is detecting the absence of zstandard in the C library, properly, and reflects it (as shown in the libnetcdff.settings.txt file attached above). I'm at a loss as to why it's happening, or why installing zstd and libzstd-dev fixes it, but am still sorting through it. I'm going to move to a Linux VM instead of the docker container and see if I can recreate it there (both are running latest Ubuntu, however).

It is peculiar in part because I don't recompile netcdf-c after installing libzstd-dev/zstd.

After installing and compiling netcdf-fortran (and without re-running configure, either, just make clean), I see the following when I run $ ldd fortran/.libs/libnetcdff.so.7:

(...)
libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1 (0x0000ffff8ba40000)
(...)

Relevant, I feel, but I need to determine why this dependency exists.

@edwardhartnett it's up in the first comment, see the attached test-suite.log.txt file.

It's possible/probable that zstd is a red herring here, as I may not have been passing --enable-parallel-tests to configure when running by hand. Regardless, the failure reported above remains to be debugged.

Recreating now in a clean VM (Ubuntu 22.04), stand by.

Neither zstd nor libzstd-dev are installed when the following were run/generated.

HDF5 1.12.2

 $ CC=mpicc ./configure --disable-static --enable-shared --prefix=/usr --enable-parallel --with-szlib && make -j 12 && sudo make install

NetCDF-C v4.9.0

$ CC=mpicc ./configure --disable-static --enable-shared --prefix=/usr && make -j 12 && sudo make install 

NetCDF-Fortran v4.6.0-wellspring.wif

$ autoreconf -if 
$ FC=mpifort CC=mpicc ./configure --disable-static --enable-shared --enable-parallel-tests && make check -j 12 TESTS="" 
$ make check

This last make check invocation results in the failure as seen above, and attached below (with accompanying [package].settings files.

I've confirmed that this happens regardless of whether zstd and libzstd-dev are installed, so it's likely I forgot to include --enable-parallel-tests when I initially was debugging this, and determined the subsequent success was related when in fact it wasn't. It's just a more straightforward STOP 170 failure.

Seeing additional output on a different machine:


wardf@dev:~/Desktop/gitprojects/netcdf-fortran/nf03_test4$ ./run_f90_par_test.sh
Testing netCDF parallel I/O through the F90 API.

 *** Testing netCDF-4 parallel I/O from Fortran 90.
 *** SUCCESS!

 *** Testing netCDF-4 parallel I/O with strided access.
 *** SUCCESS!

 *** Testing netCDF-4 parallel I/O with fill values.
 *** SUCCESS!
           0           1           1           1           3           1           1   1.00000000       2.00000000       3.00000000
           3           4           2           1           3           1           1   10.0000000       11.0000000       12.0000000
           4           1           3           1           3           1           1   13.0000000       14.0000000       15.0000000
           5           4           3           1           3           1           1   16.0000000       17.0000000       18.0000000
           6           1           4           1           3           1           1   19.0000000       20.0000000       21.0000000
           1           4           1           1           3           1           1   4.00000000       5.00000000       6.00000000
           2           1           2           1           3           1           1   7.00000000       8.00000000       9.00000000
           7           4           4           1           3           1           1   22.0000000       23.0000000       24.0000000
 *** Testing fill values with parallel I/O.
 *** SUCCESS!
 *** Testing compressed data writes with parallel I/O.
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
STOP 170
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
STOP 170
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
STOP 170
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
STOP 170

Huh.

This 'failure' does not appear when running natively in MacOS (on an ARM/M1-based system) using mpich installed by homebrew. The 'failure' appears in a docker container and a virtual machine hosted on the same machine, each one running the ARM version of Ubuntu 22.04. The 'failure' appears with additional info referencing IEEE_DENORMAL in n a WSL-based Ubuntu 22.04 environment hosted on an x86_64-based Windows 11 machine.

Initial research suggests that the Failure may be a 'failure' in that it does not indicate a show-stopper or loss of data. I need to convince myself whether that's true or not, but still looking into it.

Passes on host M1 machine using mpifort version 11.3.0.
Passes on a Debian 11-based VM on an M1 host using mpifort version 9.4.0.
'Fails' on the Ubuntu 22.04-based VM on an M1 host using mpifort version 11.2.0

I'm starting to think this issue is simply an environmental issue that should be treated more akin to a warning, unless it can be easily mitigated/dismissed/suppressed. So, it may not necessarily be a blocking issue for the release. If that's the case, I'll try to get the release out tomorrow. Out of curiosity, @edwardhartnett, what Fortran compiler/version are you using that you (presumably) aren't seeing this?

Ok, I think I've got it sorted; if I'm reading this correctly, a variable is not being populated in this test but, on some platforms, this populates the array with '0' and other platforms, it doesn't, resulting in this platform-consistent, but undefined, behavior. Got the fix going in, will update our Github actions, and get this release out.

Actually, github actions don't need to be updated. So, propagating this into the release branch and it will also go back up into main.