Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly

Question

Clang optimizing SSE2NEON_PRECISE_MINMAX incorrectly

markreidvfx opened this issue a year ago · comments

This might be a bug in clang but figure I'd report it here first.

I have a technique I use to clamp NaN values to zero.
It's pretty simple, you exploit the fact, nan > 0.0f == false

#define MIN(a,b) ((a) > (b) ? (b) : (a))
#define MAX(a,b) ((a) > (b) ? (a) : (b))
MIN(amax, MAX(a, amin));

The MAX is done first on purpose.

The SSE2 code is this

_mm_min_ps(amax, _mm_max_ps(a, amin));

I'm having issues with clang's optimizer messing up this behaviour and nans still propagating.

The neon min/max instructions propagates NaNs and SSE2 ones don't (ish), so I've been defining SSE2NEON_PRECISE_MINMAX 1
the _mm_max_ps intrinsic becomes

vbslq_f32(vcgtq_f32(a, b), a, b);

This looks perfectly correct to me, but clang is optimizing this to the fmaxnm instruction. The fmaxnm instruction only deals with quiet NaNs, signalling NaNs still propagate. :(

NaNs are handled according to the IEEE 754-2008 standard. If one vector element is numeric and the other is a quiet NaN, the result placed in the vector is the numerical value, otherwise the result is identical to FMAX (scalar).

https://developer.arm.com/documentation/ddi0596/2021-12/SIMD-FP-Instructions/FMAXNM--vector---Floating-point-Maximum-Number--vector--

Here is a small program illustrating this happening
https://godbolt.org/z/eE1G3Gcov

I'm currently working around this by using inline assembly.

Cuda Chen · Answer 1 · Mon Aug 14 2023 21:01:37 GMT+0800 (China Standard Time)

Hi @markreidvfx ,

For my personal point-of-view, I think this may be an issue of Clang.
For GCC with -O3 flag, it uses fcmgt, and, and bsl.
Here is a small program (modified by your example) for illustration: https://godbolt.org/z/sfrKbx1e8

One more, thing, kindly leave the link for the discussion on Clang forum if possible.

Mark Reid · Answer 2 · Tue Aug 15 2023 00:16:12 GMT+0800 (China Standard Time)

Yes, that's my opinion too, especially since if you compile in debug the code works.
I'll report it to clang and see what they say.

The same thing can also happen with scalar code.
https://godbolt.org/z/d4j9418Kx

I can trick the compiler by subtly changing the clamp function, but who know for how long that will last...
https://godbolt.org/z/rq36Trb4d