9.2.2 testsuite failures on MIPS architectures

Question

9.2.2 testsuite failures on MIPS architectures

mbanck opened this issue 3 years ago · comments

The 9.2.2 has introduced a testsuite regression on MIPS (32bit or 64bit little endian) architectures, see https://buildd.debian.org/status/package.php?p=abinit

In version 8.10.3 the testsuites still passed, see https://buildd.debian.org/status/logs.php?pkg=abinit&arch=mips64el and https://buildd.debian.org/status/fetch.php?pkg=abinit&arch=mips64el&ver=8.10.3-3&stamp=1601210217&raw=0

Debian runs runtests.py fast as test-suite, the fldiff of e.g. t24 is this:

# YAML support is available, but is disabled for this test.
# Start legacy fldiff comparison report
2
< .Version 9.0.0 of ABINIT
> .Version 9.2.2 of ABINIT
3
< .(MPI version, prepared for a x86_64_linux_gnu9.2 computer)
> .(MPI version, prepared for a mipsel_linux_gnu10.2 computer)
17
< .Starting date : Mon 24 Feb 2020.
> .Starting date : Mon 22 Feb 2021.
208
<  ETOT  1  -32.145013381823    -3.215E+01 2.148E-01 5.370E+02
>  ETOT  1  -32.033288641759    -3.203E+01 2.219E-01 5.278E+02
211
<  Fermi (or HOMO) energy (hartree) =  -0.09171   Average Vxc (hartree)=  -0.32386
>  Fermi (or HOMO) energy (hartree) =   0.04698   Average Vxc (hartree)=  -0.32597
214
<   -0.81296   -0.47940   -0.40280   -0.38621   -0.33010   -0.29020   -0.13508
>   -0.70867   -0.31862    0.00000    0.00000    0.00000    0.00042    0.01726
216
<   -0.81834   -0.48453   -0.45154   -0.37558   -0.35487   -0.31470   -0.09171
>   -0.55458   -0.21241    0.00000    0.00000    0.00000    0.00172    0.04698
217
<  Fermi (or HOMO) energy (eV) =  -2.49552   Average Vxc (eV)=  -8.81280
>  Fermi (or HOMO) energy (eV) =   1.27840   Average Vxc (eV)=  -8.87007
220
<  -22.12165  -13.04502  -10.96066  -10.50918   -8.98258   -7.89688   -3.67561
>  -19.28399   -8.67023    0.00000    0.00000    0.00000    0.01153    0.46962
222
<  -22.26807  -13.18474  -12.28698  -10.22016   -9.65640   -8.56333   -2.49552
>  -15.09085   -5.77999    0.00000    0.00000    0.00000    0.04674    1.27840
224
<  ETOT  2  -33.753967030558    -1.609E+00 2.489E-02 4.539E+01
>  ETOT  2  -33.742599336313    -1.709E+00 4.567E-02 5.285E+01
227
<  Fermi (or HOMO) energy (hartree) =  -0.31394   Average Vxc (hartree)=  -0.28713
>  Fermi (or HOMO) energy (hartree) =   0.03747   Average Vxc (hartree)=  -0.28607
230
<   -0.76285   -0.70359   -0.70304   -0.70000   -0.32761   -0.31692   -0.31394
>   -0.74231   -0.49367    0.00000    0.00000    0.00000    0.00000    0.03747
232
<   -0.76125   -0.70569   -0.70382   -0.70077   -0.33349   -0.32170   -0.31985
>   -0.61086   -0.49028    0.00000    0.00000    0.00000    0.00841    0.00945
233
<  Fermi (or HOMO) energy (eV) =  -8.54262   Average Vxc (eV)=  -7.81322
>  Fermi (or HOMO) energy (eV) =   1.01964   Average Vxc (eV)=  -7.78430
236
<  -20.75817  -19.14553  -19.13072  -19.04794   -8.91472   -8.62394   -8.54262
>  -20.19934  -13.43336    0.00000    0.00000    0.00000    0.00000    1.01964
238
<  -20.71471  -19.20279  -19.15196  -19.06899   -9.07466   -8.75393   -8.70365
>  -16.62244  -13.34133    0.00000    0.00000    0.00000    0.22890    0.25711
240
[...]
635
<  band_energy         : -6.32824180479714E+00
>  band_energy         : -1.92065546421853E+00
657
<   sigma(1 1)=  3.86327862E-04  sigma(3 2)=  0.00000000E+00
>   sigma(1 1)=  3.86333051E-04  sigma(3 2)=  0.00000000E+00
658
<   sigma(2 2)=  3.86327862E-04  sigma(3 1)=  0.00000000E+00
>   sigma(2 2)=  3.86333051E-04  sigma(3 1)=  0.00000000E+00
659
<   sigma(3 3)=  3.86327862E-04  sigma(2 1)=  0.00000000E+00
>   sigma(3 3)=  3.86333051E-04  sigma(2 1)=  0.00000000E+00
702
<            strten      3.8632786150E-04  3.8632786150E-04  3.8632786150E-04
>            strten      3.8633305104E-04  3.8633305104E-04  3.8633305104E-04
1091
< +Overall time at end (sec) : cpu=          2.1  wall=          2.7
> +Overall time at end (sec) : cpu=         12.6  wall=         12.6
Summary t24.out: different lines=172, max abs_diff=1.915e+01 (l.238), max rel_diff=1.000e+00 (l.211).

Is this a known issue, is there some work-around? Debian does not compile MIPS differently to, say, ARM or Intel/AMD.

Michael Banck · Answer 1 · Mon Feb 22 2021 22:12:08 GMT+0800 (China Standard Time)

For t11 there is no convergence; also the fftalg is different, could that be relevant?

--- fast/Refs/t11.out   2020-11-10 12:21:53.000000000 +0000
+++ Test_suite/fast_t03-t05-t06-t07-t08-t09-t11-t12-t14-t16/t11.out     2021-02-22 12:42:16.524181381 +0000
[...]
--          fftalg         312
+-          fftalg         112
[...]
@@ -150,37 +150,50 @@
 ================================================================================
  prteigrs : about to open file t11o_EIG
  Non-SCF case, kpt    1 (  0.00000  0.00000  0.00000), residuals and eigenvalues=
-  1.32E-13  2.40E-13  2.23E-13  2.32E-13  1.20E-13  1.48E-13  1.23E-13  6.81E-13
- -2.2607E-01  2.1504E-01  2.1504E-01  2.1504E-01  3.0745E-01  3.0745E-01
-  3.0745E-01  3.2917E-01
+  2.69E-09  6.65E-04  2.13E-03  7.52E-04  5.52E-04  2.20E-03  4.56E-04  1.17E-04
+ -2.2607E-01 -2.7122E-03 -2.8771E-04  0.0000E+00  0.0000E+00  2.2553E-01
+  2.4413E-01  3.2012E-01
+  prteigrs : nnsclo,ikpt=   20    1 max resid (incl. the buffer)=  2.19810E-03
  Non-SCF case, kpt    2 (  0.00000  0.50000  0.50000), residuals and eigenvalues=
-  6.05E-14  3.60E-13  4.78E-14  5.51E-14  5.11E-13  5.09E-13  7.46E-13  1.70E-13
- -7.3665E-02 -7.3665E-02  1.0775E-01  1.0775E-01  2.3725E-01  2.3725E-01
-  5.7958E-01  5.7958E-01
+  3.99E-21  1.76E-03  6.60E-03  3.34E-03  7.73E-03  3.44E-03  1.58E-02  9.46E-07
+ -7.3665E-02 -7.0495E-02 -1.0501E-06 -4.0302E-17  1.4324E-01  2.4859E-01
+  5.4141E-01  5.7958E-01
+  prteigrs : nnsclo,ikpt=   20    2 max resid (incl. the buffer)=  1.58493E-02
[...]
  Non-SCF case, kpt    8 (  0.50000  0.00000  0.00000), residuals and eigenvalues=
-  3.25E-13  5.23E-13  1.79E-13  4.55E-13  7.88E-14  3.97E-13  4.23E-13  7.41E-13
- -1.3943E-01 -4.4132E-02  1.6952E-01  1.6952E-01  2.6640E-01  3.3740E-01
-  3.3740E-01  4.8989E-01
+  1.93E-11  1.22E-09  3.47E-03  1.68E-03  1.60E-03  5.80E-04  6.20E-04  2.71E-18
+ -1.3943E-01 -4.4132E-02 -1.3892E-17 -1.0491E-18  1.2548E-19  1.1470E-17
+  3.2797E-01  4.8989E-01
+  prteigrs : nnsclo,ikpt=   20    8 max resid (incl. the buffer)=  3.47131E-03
+
+
+ scprqt:  WARNING -
+  nstep=   20 was not enough non-SCF iterations to converge;
+  maximum residual=  6.151E-02 exceeds tolwfr=  1.000E-12
 
 
 --- !ResultsGS
@@ -193,7 +206,7 @@
 lattice_lengths: [   7.25711,    7.25711,    7.25711, ]
 lattice_angles: [ 60.000,  60.000,  60.000, ] # degrees, (23, 13, 12)
 lattice_volume:   2.7025701E+02
-convergence: {deltae:  0.000E+00, res2:  0.000E+00, residm:  9.768E-13, diffor:  0.000E+00, }
+convergence: {deltae:  0.000E+00, res2:  0.000E+00, residm:  6.151E-02, diffor:  0.000E+00, }

Matteo Giantomassi · Answer 2 · Tue Feb 23 2021 09:22:15 GMT+0800 (China Standard Time)

For t11 there is no convergence; also the fftalg is different, could that be relevant?

fftalg 312 corresponds to an external FFTW3 library while fftalg 112 is the internal FFT Fortran library shipped with Abinit. The reference files have been produced using our reference machine and an external FFTW3 compiled with gcc/gfortran.
In principle the results of the tests should not depend on the FFT library unless there's a serious portability problem either in the external library or in the internal version.
(on our test farm we systematically test FFTW3, MKL-DFTI and the internal FFT Fortran routines).

Can you run:

runtests.py unitary

to execute the unit tests for the different FFT interfaces?

If these tests are OK, I would say that fftalg 112 works as expected and the origin of the problem should be found somewhere else.

Note that Abinit v9 significantly differs from v8 both at the level of the build systems as well as the level of the internal implementation.
I had a look at the log file for mips64el and I didn't spot any significant configuration problem. We also have a buildbot worker that tests gnu 10.2

@jmbeuken

Michael Banck · Answer 3 · Tue Feb 23 2021 17:27:16 GMT+0800 (China Standard Time)

I have to backtrack here; I've tried abinit-8.10.3 with the same toolchain (Debian unstable), and tests fail there on MIPS already (that was fine back in September, see https://buildd.debian.org/status/fetch.php?pkg=abinit&arch=mips64el&ver=8.10.3-3&stamp=1601210217&raw=0)

So it seems something in newer compilers (gfortran-10.2?) etc. Will have to dig deeper.