DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Test failures with newer versions of clang/gcc using O1 and O2

anthony-linaro opened this issue · comments

After having a little more of a dig in relation to #633 - it appears that tests are failing on all platforms, including linux, when using newer versions of clang (tested against LLVM 14 through to LLVM 18), and using higher optimization levels.

GCC also has a subset of these failures.

This is reproducible on Linux (WSL - below) and Windows (#633).

A log (thanks to @pbo-linaro) is as follows:

$ clang++ --version
Debian clang version 14.0.6
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ clang++ tests/*.cpp -I . -O0 -o test
$ ./test  |& grep -i failed
Failed: 0

$ clang++ tests/*.cpp -I . -O1 -o test
$ ./test  |& grep -i failed
Test mm_set_rounding_mode           failed
Test mm_setcsr                      failed
Test mm_storeu_si64                 failed
Test mm_set_denormals_zero_mode     failed
Failed:  4

$ clang++ tests/*.cpp -I . -O2 -o test
$ ./test  |& grep -i failed
Test mm_set_rounding_mode           failed
Test mm_setcsr                      failed
Test mm_storeu_si64                 failed
Test mm_cvtpd_epi32                 failed
Test mm_cvtpd_pi32                  failed
Test mm_cvttpd_epi32                failed
Test mm_cvttpd_pi32                 failed
Test mm_set_denormals_zero_mode     failed
Failed:  8

Nothing suspect is returned by fsanitize=address or fsanitize=undefined.

At this point it's a little beyond my level of expertise with SIMD/NEON, so it would be appreciated if someone a little more versed could have a look.

After looking a bit more into this, it seems like it happens when non initialized m128 variables are used.
The way the tests are written is breaking the strict aliasing rule. In general, you are not supposed to access a vector word, using integer pointers.

I don't know why MSVC has no issue with it, but it seems like clang and gcc are both taking advantage and optimize away some parts.

g++ -Wstrict-aliasing=1 -O3 tests/*.cpp -I . , will return a list of the problems.

Hi @anthony-linaro @pbo-linaro I am trying to set up CI to cover more compilers. Whether do you guys know some repos I can refer how to run CI with multiple c compiler on ARM machine?

It's not necessarily needed to have access to an ARM machine.
It's possible to cross compile the tests and run the binary using qemu (on linux).
In reality, it's already what is done for current CI.

What is missing today is to run CI with optimizations enabled (-O2 or -O3). This can be easily done by adding this flag in project Makefile, and it will expose the issues we presented here.

I will add the optimizations enabled tests, but in case of other potential issues, I am thinking may be testing with more versions of compilers is a good idea

It's a good idea indeed! LLVM offer convenient debian/ubuntu repositories to grab latest version.

Hi @anthony-linaro @pbo-linaro sorry to answer late. I am curious the way you identify the errors are caused by strict aliasing rule?
My ability to identify the errors here is weak. However, I think the common function that all the error tests call is _sse2neon_get_fpcr() and _sse2neon_set_fpcr(). In these two functions something like

__asm__ __volatile__("vmrs %0, FPSCR" : "=r"(r.value)); /* read */

is called. I think maybe the optimization option change the order or ignore this part?

In general, errors appearing with optimizations enabled are due to undefined behavior (especially when different compilers trigger errors - clang and gcc in our case).

When I took a look at the code, I noticed that 128bits integers were stored under an int* pointer, and I know this is undefined in C/C++. Why does it trigger an issue only in some functions, and with uninitialized values? I don't know. But as it is undefined behavior, your compiler is free to do what it wants, so investigating the why is not really worth: it's faster to fix the code instead.

gcc strict aliasing warning can have false positives, but it confirmed that something could be wrong: g++ -Wstrict-aliasing=1 -O3 tests/*.cpp -I ..
Another tip I used is to declare some variables as volatile, which, in general, prevent the compiler to optimize or reorder things concerning them. But it should not be used as a solution (people too often rely on this for multithreaded code, and optimized code with issues, and it's just a workaround).

About the error left, I don't have bandwidth to investigate now, but I suggest that you look at generated assembly code for concerned tests (using llvm-objdump -d test.exe), and compare between optimized and non optimized version. You should be able to see if any reordering was done, or if something is unexpected.
In general, this is a good exercise, and you should definitely learn something about this!

Hope this helps!

I still can't identify the error. Here is the minimal test that produce the same error

#include <arm_neon.h>

#define _MM_ROUND_NEAREST 0x0000
#define _MM_ROUND_DOWN 0x2000
#define _MM_ROUND_UP 0x4000
#define _MM_ROUND_TOWARD_ZERO 0x6000

#define _MM_SET_ROUNDING_MODE(rounding)                                        \
  do {                                                                         \
    volatile union {                                                           \
      fpcr_bitfield field;                                                     \
      uint64_t value;                                                          \
    } r;                                                                       \
                                                                               \
    __asm__ __volatile__("mrs %0, FPCR" : "=r"(r.value));                      \
                                                                               \
    switch (rounding) {                                                        \
    case _MM_ROUND_TOWARD_ZERO:                                                \
      r.field.bit22 = 1;                                                       \
      r.field.bit23 = 1;                                                       \
      break;                                                                   \
    case _MM_ROUND_DOWN:                                                       \
      r.field.bit22 = 0;                                                       \
      r.field.bit23 = 1;                                                       \
      break;                                                                   \
    case _MM_ROUND_UP:                                                         \
      r.field.bit22 = 1;                                                       \
      r.field.bit23 = 0;                                                       \
      break;                                                                   \
    default: /* _MM_ROUND_NEAREST */                                           \
      r.field.bit22 = 0;                                                       \
      r.field.bit23 = 0;                                                       \
    }                                                                          \
                                                                               \
    __asm__ __volatile__("msr FPCR, %0" ::"r"(r.value)); /* write */           \
  } while (0)

typedef float32x4_t __m128;

typedef struct {
  uint16_t res0;
  uint8_t res1 : 6;
  uint8_t bit22 : 1;
  uint8_t bit23 : 1;
  uint8_t bit24 : 1;
  uint8_t res2 : 7;
#if defined(__aarch64__) || defined(_M_ARM64)
  uint32_t res3;
#endif
} fpcr_bitfield;

#define ASSERT_RETURN(x)                                                       \
  if (!(x))                                                                    \
    return 1;

int _validate128(__m128 c, __m128 b) {
  uint32x4_t result = vceqq_f32(c, b);
  return (vminvq_u32(result) == UINT32_MAX) ? 0 : 1;
}

int target() {
  int res_to_neg_inf = 0, res_nearest = 0;

  float _a[4] = {1.3, 4.7, 100.1, 23123.667};
  __m128 a = vld1q_f32(_a);
  __m128 b, c;

  _MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
  b = vrndiq_f32(a);
  c = vrndmq_f32(a);
  res_to_neg_inf = validate128(b, c);
  if (res_to_neg_inf) {
    return 3;
  }

  _MM_SET_ROUNDING_MODE(_MM_ROUND_NEAREST);
  b = vrndiq_f32(a);
  c = vrndnq_f32(a);
  res_nearest = validate128(b, c);
  if (res_nearest) {
    return 5;
  }
  return 0;
}

int main() {
  int t = target();

  return t;
}

The compile command is

$ gcc -Wall -Wcast-qual -Wstrict-aliasing=1 -O3 -g -I. -std=gnu11 -march=armv8-a+fp+simd main.c -lm -o main

If there only one rounding method is used (in the above one is two), the test can pass.
One thing I noticed is, in both failed and success one, I can't find the rounding instructions. All three rounding instructions disappeared (FRINTI, FRINTM, FRINTN). All these instructions exist in the test where optimization is disabled.

I guess this could be the reason.

Thanks for reducing the test case.

I investigated this using clang. At -O0, it works, and at -O1 it fails.
First, I would like to say the round mode setting code is correct, after checking arm documentation and printing values set for fpcr register, so there is no bug on this side.

I noticed that deactivating optimization on variable __m128 a (either make it volatile, or declare it globally) solved the problem at -O1.

--- a/main.c
+++ b/main.c
@@ -62,7 +62,7 @@ int target() {
   int res_to_neg_inf = 0, res_nearest = 0;

   float _a[4] = {1.3, 4.7, 100.1, 23123.667};
-  __m128 a = vld1q_f32(_a);
+  volatile __m128 a = vld1q_f32(_a);
   __m128 b, c;

   _MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);

By comparing the assembly generated (llvm-objdump -d main), you'll notice that the version failing is more optimized, and b, c values are completely computed at compile time (see attached picture).

Screenshot_2024-06-20_13-03-36

What happens is that the compiler has no idea that you changed the default rounding mode (by setting fpcr register), and thus, optimize happily with the default IEEE rounding mode (to nearest).
To extend this, in the original sse2neon test case, adding inline optimizations will amplify the problem: if you change several time rounding mode in the same generated function, the compiler won't notice it and optimize as much as it can.

I didn't look as extensively gcc generated code, but I'm pretty sure we just see a compiler optimizing code without knowing the semantic of rounding mode.
As long as a constant floating point value will be visible, and we use builtin to operate on that, the compiler will optimize all this aggressively.

I tried setting fp rounding using standard function fesetround from fenv.h, but it didn't solve the issue.
gcc has an option (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-frounding-math) to explicitly disable those optimizations, but when compiling the test, it does not change the output (and clang does not seem to support it on arm64).

I don't see which portable solution can be used to guarantee a compiler won't do those optimizations (playing with volatile may work, but it's not what we want). I would avoid to change the rounding mode in the middle of a function, and perform computations associated to a given rounding mode in a separate function, marked non inline.

Another very expensive solution would be to replace all use of compilers builtin in sse2neon with inline assembly code, to prevent the compiler from optimizing those instructions away. But I'm really not sure it's the right solution to solve this. Program generally set rounding mode once, and don't play with it all the time, it's really a special case.

At this point, I would find interesting to raise this issue on llvm project, or gcc mailing list if you want to dig further.

@pbo-linaro Thank you for the super informative explanation. It is a valuable opportunity to broaden the field I didn't experience, and I think that would be cool to report to llvm and gcc.
And for sse2neon part. I think I can add volatile modifier to the __m128 variables in the relating tests and add the optimization options to CI. In the meantime, descriptions will be appended to the functions which mention the potential error during changing the rounding mode.

@jserv what do you think?

You can definitely try that and check tests now work on all optimization levels with gcc and clang.

Another solution is to disable optimization explicitly for functions that manipulate the rounding mode (see this). It is probably more reliable than playing with volatile.

You can definitely try that and check tests now work on all optimization levels with gcc and clang.

Another solution is to disable optimization explicitly for functions that manipulate the rounding mode (see this). It is probably more reliable than playing with volatile.

I agree with the option to disable optimizations for certain rounding-specific functions. It can be applied with some GCC attributes.

#640 is considered as a feasible solution. @anthony-linaro, please check if it fits.

Good work!

I can still observe some failures:

$ g++ --version
g++ (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ clang++ --version
Debian clang version 14.0.6
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
$ g++ -O0 tests/*.cpp -I . && ./a.out |& grep failed
$ g++ -O1 tests/*.cpp -I . && ./a.out |& grep failed
$ g++ -O2 tests/*.cpp -I . && ./a.out |& grep failed
Test mm_dp_pd                       failed
$ g++ -O3 tests/*.cpp -I . && ./a.out |& grep failed
Test mm_dp_pd                       failed
$ clang++ -O0 tests/*.cpp -I . && ./a.out |& grep failed
$ clang++ -O1 tests/*.cpp -I . && ./a.out |& grep failed
$ clang++ -O2 tests/*.cpp -I . && ./a.out |& grep failed
Test mm_cvttpd_epi32                failed
Test mm_cvttpd_pi32                 failed
$ clang++ -O3 tests/*.cpp -I . && ./a.out |& grep failed
Test mm_cvttpd_epi32                failed
Test mm_cvttpd_pi32                 failed

Thank you for testing. I will fix them together with
#638

I think some of them may cause by breaking the strict-aliasing rules