DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with _mm_alignr_epi8 and constants

bigianb opened this issue · comments

I'm seeing the following error in my code:

GSVector4i.h:676:21: error: argument value -8 is outside the valid range [0, 15]
                return GSVector4i(_mm_alignr_epi8(v.m, m, i));
                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~
sse2neon.h:6552:44: note: expanded from macro '_mm_alignr_epi8'
                    vreinterpretq_m128i_u8(vextq_u8(tmp_low, tmp_high, idx)); \
                                           ^                           ~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.5/include/arm_neon.h:6812:24: note: expanded from macro 'vextq_u8'
  __ret = (uint8x16_t) __builtin_neon_vextq_v((int8x16_t)__s0, (int8x16_t)__s1, __p2, 48); \
                       ^                                                        ~~~~
sse2neon.h:222:56: note: expanded from macro 'vreinterpretq_m128i_u8'
#define vreinterpretq_m128i_u8(x) vreinterpretq_s64_u8(x)
                                                       ^
GSBlock.h:1280:12: note: in instantiation of function template specialization 'GSVector4i::srl<8>' requested here
                        v4 = v5.srl<8>(v6);
                                ^
6 errors generated.

The root is compiling the following code with a specialisation of 8

	template <int i>
	__forceinline GSVector4i srl(const GSVector4i& v)
	{
		return GSVector4i(_mm_alignr_epi8(v.m, m, i));
	}

That should be fine but it barfs on the last line of the snippet below:

#define _mm_alignr_epi8(a, b, imm)                                            \
    __extension__({                                                           \
        __m128i ret;                                                          \
        if (_sse2neon_unlikely((imm) >= 32)) {                                \
            ret = _mm_setzero_si128();                                        \
        } else {                                                              \
            uint8x16_t tmp_low, tmp_high;                                     \
            if (imm >= 16) {                                                  \
                const int idx = imm - 16;                                     \
                tmp_low = vreinterpretq_u8_m128i(a);                          \
                tmp_high = vdupq_n_u8(0);                                     \
                ret =                                                         \
                    vreinterpretq_m128i_u8(vextq_u8(tmp_low, tmp_high, idx)); \

It looks like the macro expansion for vextq_u8 is triggering the compiler error on a bounds check. The code would never execute because of the imm >= 16 check but the code is still generated and the const propagation looks to then trigger the error. Whether the last line get expanded I guess is down to the whim of the optimiser.

This is compiling with a M1 mac mini with the latest xcode:

% gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 12.0.5 (clang-1205.0.22.11)
Target: x86_64-apple-darwin20.6.0
Thread model: posix