Failure w/ mpicc in `nf03_test4/run_f90_par_test.sh`
WardF opened this issue · comments
Observing in the docker image I'm using for testing. Potentially a last minute block for the release.
See attached files for the error and installation details. Issue is happening with configure
-based builds, haven't tested with cmake
-based builds yet. Libraries libnetcdf
and libnetcdff
are being built on a system without zstd
and libzstd-dev
packages installed. The libnetcdf[f].settings
files reflect that both are being built without zstandard support. run_f90_par_test.sh
is failing when testing compressed data writes with parallel I/O. My immediate guess is that it is attempting to perform zstandard tests, maybe? Installing zstd
and libzstd-dev
correct this issue. Currently working on nailing down what's going on.
libnetcdf.settings.txt
libnetcdff.settings.txt
test-suite.log.txt
Installing zstd
and libzstd-dev
on the system, without recompiling libnetcdf
but recompiling libnetcdff
results in everything passing. The corresponding libnetcdff.settings still shows Zstandard support
as no
.
So you are saying netcdf-fortran is not correctly detecting that zstandard is not built in the C library?
@edwardhartnett No, that's the weird part, it is detecting the absence of zstandard in the C library, properly, and reflects it (as shown in the libnetcdff.settings.txt
file attached above). I'm at a loss as to why it's happening, or why installing zstd
and libzstd-dev
fixes it, but am still sorting through it. I'm going to move to a Linux VM instead of the docker container and see if I can recreate it there (both are running latest Ubuntu, however).
It is peculiar in part because I don't recompile netcdf-c after installing libzstd-dev
/zstd
.
After installing and compiling netcdf-fortran (and without re-running configure
, either, just make clean
), I see the following when I run $ ldd fortran/.libs/libnetcdff.so.7
:
(...)
libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1 (0x0000ffff8ba40000)
(...)
Relevant, I feel, but I need to determine why this dependency exists.
@edwardhartnett it's up in the first comment, see the attached test-suite.log.txt
file.
It's possible/probable that zstd
is a red herring here, as I may not have been passing --enable-parallel-tests
to configure
when running by hand. Regardless, the failure reported above remains to be debugged.
Recreating now in a clean VM (Ubuntu 22.04), stand by.
Neither zstd nor libzstd-dev are installed when the following were run/generated.
HDF5 1.12.2
$ CC=mpicc ./configure --disable-static --enable-shared --prefix=/usr --enable-parallel --with-szlib && make -j 12 && sudo make install
NetCDF-C v4.9.0
$ CC=mpicc ./configure --disable-static --enable-shared --prefix=/usr && make -j 12 && sudo make install
NetCDF-Fortran v4.6.0-wellspring.wif
$ autoreconf -if
$ FC=mpifort CC=mpicc ./configure --disable-static --enable-shared --enable-parallel-tests && make check -j 12 TESTS=""
$ make check
This last make check
invocation results in the failure as seen above, and attached below (with accompanying [package].settings
files.
I've confirmed that this happens regardless of whether zstd
and libzstd-dev
are installed, so it's likely I forgot to include --enable-parallel-tests
when I initially was debugging this, and determined the subsequent success
was related when in fact it wasn't. It's just a more straightforward STOP 170
failure.
Seeing additional output on a different machine:
wardf@dev:~/Desktop/gitprojects/netcdf-fortran/nf03_test4$ ./run_f90_par_test.sh
Testing netCDF parallel I/O through the F90 API.
*** Testing netCDF-4 parallel I/O from Fortran 90.
*** SUCCESS!
*** Testing netCDF-4 parallel I/O with strided access.
*** SUCCESS!
*** Testing netCDF-4 parallel I/O with fill values.
*** SUCCESS!
0 1 1 1 3 1 1 1.00000000 2.00000000 3.00000000
3 4 2 1 3 1 1 10.0000000 11.0000000 12.0000000
4 1 3 1 3 1 1 13.0000000 14.0000000 15.0000000
5 4 3 1 3 1 1 16.0000000 17.0000000 18.0000000
6 1 4 1 3 1 1 19.0000000 20.0000000 21.0000000
1 4 1 1 3 1 1 4.00000000 5.00000000 6.00000000
2 1 2 1 3 1 1 7.00000000 8.00000000 9.00000000
7 4 4 1 3 1 1 22.0000000 23.0000000 24.0000000
*** Testing fill values with parallel I/O.
*** SUCCESS!
*** Testing compressed data writes with parallel I/O.
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
STOP 170
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
STOP 170
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
STOP 170
Note: The following floating-point exceptions are signalling: IEEE_DENORMAL
STOP 170
Huh.
This 'failure' does not appear when running natively in MacOS (on an ARM/M1-based system) using mpich installed by homebrew. The 'failure' appears in a docker container and a virtual machine hosted on the same machine, each one running the ARM version of Ubuntu 22.04. The 'failure' appears with additional info referencing IEEE_DENORMAL
in n a WSL-based Ubuntu 22.04 environment hosted on an x86_64-based Windows 11 machine.
Initial research suggests that the Failure may be a 'failure' in that it does not indicate a show-stopper or loss of data. I need to convince myself whether that's true or not, but still looking into it.
Passes on host M1 machine using mpifort version 11.3.0.
Passes on a Debian 11-based VM on an M1 host using mpifort version 9.4.0.
'Fails' on the Ubuntu 22.04-based VM on an M1 host using mpifort version 11.2.0
I'm starting to think this issue is simply an environmental issue that should be treated more akin to a warning, unless it can be easily mitigated/dismissed/suppressed. So, it may not necessarily be a blocking issue for the release. If that's the case, I'll try to get the release out tomorrow. Out of curiosity, @edwardhartnett, what Fortran compiler/version are you using that you (presumably) aren't seeing this?
Ok, I think I've got it sorted; if I'm reading this correctly, a variable is not being populated in this test but, on some platforms, this populates the array with '0' and other platforms, it doesn't, resulting in this platform-consistent, but undefined, behavior. Got the fix going in, will update our Github actions, and get this release out.
Actually, github actions don't need to be updated. So, propagating this into the release branch and it will also go back up into main.