DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for ARMv8 in 32-bit execution mode

jonathanhue opened this issue · comments

I just started using sse2neon last week to migrate some code, and found that I needed some changes to support ARMv8 in 32-bit execution mode. The code seems to assume all 32-bit targets are ARMv7 and ARMv8 is always in 64-bit execution mode. My target is 32-bit, but running on ARMv8 (Cortex-A72/A73), and I want to take advantage of new ARMv8 instructions available in 32-bit execution mode, for example, directed rounding on vectors. An example command line is

$ clang++-10 -O2 -mcpu=cortex-a73 -mfpu=neon-fp-armv8 -dM -E - < /dev/null |egrep -i 'arm_arch |_state |directed'
#define __ARM_32BIT_STATE 1
#define __ARM_ARCH 8
#define __ARM_FEATURE_DIRECTED_ROUNDING 1

$ clang-10 -v 2>&1 | grep -i target
Target: armv7l-unknown-linux-gnueabihf

Right around line 125 the compiler will bail out because my target will define __ARM_ARCH 8, but __aarch64__ is not defined because my target is 32-bit. So I had to add something like this:

#elif __ARM_ARCH == 8
#if !defined(__ARM_NEON) || !defined(__ARM_NEON__)
#error "You must enable NEON instructions (e.g. -mfpu=fp-neon-armv8) to use SSE2NEON."
#endif

Then there are some features that are conditional on __aarch64__, but which are available on ARMv8 in 32-bit mode. For example, around line 4000, there is this:

// https://msdn.microsoft.com/en-us/library/vstudio/xdc42k5e(v=vs.100).aspx
// *NOTE*. The default rounding mode on SSE is 'round to even', which ARMv7-A
// does not support! It is supported on ARMv8-A however.
FORCE_INLINE __m128i _mm_cvtps_epi32(__m128 a)
{
#if defined(__aarch64__)
    switch (_MM_GET_ROUNDING_MODE()) {
    case _MM_ROUND_NEAREST:

I believe this would be better if conditional on __ARM_FEATURE_DIRECTED_ROUNDING instead of __aarch64__, since it is available in 32-bit execution mode on ARMv8. There are probably other examples of things like this. The directed rounding intrinsics are not available for 32-bit ARM targets on any version of GCC as far as I can tell, but are available in clang back to version 8 at least (tested on Godbolt Compiler Explorer), and on GCC for aarch64 targets (and __ARM_FEATURE_DIRECTED_ROUNDING is properly set to indicate support).

Thank @jonathanhue for pointing this out. Can you send some pull requests which enable 32-bit support for ARMv8-A based cores? It is definitely involved in diverse tests to confirm.

I am thinking of the feasibility to extend the existing GitHub Actions for such scenario.

Sure, expect something in the next day or two.

To identify the new intrinsics available on ARMv8 while in 32-bit execution mode, I went to this page: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html and then did a grep for anything that was listed as "A32/A64" (i.e. not available for v7). That gave me 582 intrinsics, of which 578 were unique.

Because there are so many new intrinsics, the approach I would take is to find everything that is currently conditional on __aarch64__, and see if the NEON intrinsics used are available on A32. Those would be the candidates that could be changed from being conditional on __aarch64__ to being conditional on __ARM_ARCH == 8, or some other preprocessor macro such as __ARM_FEATURE_DIRECTED_ROUNDING instead. In theory, this wouldn't break anything, since currently __ARM_ARCH is going to be 8 if __aarch64__ is defined, and 32-bit on ARMv8 wasn't supported before.

a32new_uniq.txt

Attached file is the list of 578 new intrinsics available for A32. Any feedback on this approach is welcome.

Edit: Also, this made more complicated by the fact that GCC doesn't support many of the new intrinsics when compiling for 32-bit while clang does.

I think a low-risk first step would be to enable use of the new intrinsics whose support is indicated by one of the __ARM_FEATURE_XXX macros. Compared to compiling for ARMv7, compiling for 32-bit ARMv8 adds the following with clang:

#define __ARM_FEATURE_CRC32 1
#define __ARM_FEATURE_DIRECTED_ROUNDING 1
#define __ARM_FEATURE_IDIV 1
#define __ARM_FEATURE_NUMERIC_MAXMIN 1

GCC adds:

#define __ARM_FEATURE_IDIV 1
#define __ARM_FEATURE_NUMERIC_MAXMIN 1

and only defines __ARM_FEATURE_CRC32 if the -march includes it, i.e. -march=armv8-a+crc. CRC is enabled by default for armv8 on clang, and requires -mnocrc to disable it.

For both gcc and clang, specyfing "-mfpu=crypto-neon-fp-armv8" enables __ARM_FEATURE_CRYPTO.

I've made the changes to utilize the new rounding and CRC intrinsics available for 32-bit targets on ARMv8. The directed rounding intrinsics help quite a bit; I have a function that now processes samples at a 35% higher rate. I did run into a problem with older versions of GCC and an include file that GCC itself supplies.

The CRC intrinsics are declared in a separate include file (arm_acle.h), and I want to include it in sse2neon.h so the SSE CRC intrinsics such as _mm_crc32_u16 can call them. Due to this bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81497, some older versions of GCC will fail to compile without the -fpermissive flag. It is fixed in 8.2 and later versions, but broken in 8.1 and 7.X. So I'm not sure what to do. I'm inclined to leave it as-is, rather than disable their use when a broken compiler is detected, so that the user can take advantage of them by using the -fpermissive flag.

The CRC intrinsics are declared in a separate include file (arm_acle.h), and I want to include it in sse2neon.h so the SSE CRC intrinsics such as _mm_crc32_u16 can call them. Due to this bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81497, some older versions of GCC will fail to compile without the -fpermissive flag. It is fixed in 8.2 and later versions, but broken in 8.1 and 7.X. So I'm not sure what to do. I'm inclined to leave it as-is, rather than disable their use when a broken compiler is detected, so that the user can take advantage of them by using the -fpermissive flag.

@jonathanhue, regarding gcc prior to version 8.2, how do you think to propose the potential fixes?

I had to go back and review the problem. The problem is isolated to 32-bit ARMv8 builds. The bug doesn't exist when compiling for 64-bit, nor on ARMv7 since ARMv7 doesn't have the CRC instructions. So previously supported platforms aren't affected by this.

I don't have a great solution, but there is probably a less bad solution than the current behavior. The current behavior is that the #include of <arm_acle.h> results in a compiler error unless "-fpermissive" is used, on versions of GCC prior to 8.2. This does not allow users who either 1) are not using _mm_crc32_u*() intrinsics, or 2) are using them but are not performance sensitive from compiling without -fpermissive. Compiling with -fpermissive may be undesirable for several reasons, so ideally it shouldn't be forced upon users.

What I think might be better than the current behavior is for the older compilers which cannot #include <arm_acle.h> without error, is to not include the header by default, and fall back to the C implementation. This way their code will compile and they won't have to use -fpermissive; they'll just not get the improved performance. Then there should be a preprocessor macro that users on the old compiler can define to include the header, and get the accelerated implementation, and they will have to know they have to also use -fpermissive when compiling code that #includes <arm_acle.h> (and can isolate it to a small file if desired).

By the way, shall we support __ARM_FEATURE_SIMD32?
ARM C language extensions specification describes the __ARM_FEATURE_SIMD32 macro as follows:

__ARM_FEATURE_SIMD32 is defined to 1 if the 32-bit SIMD instructions are supported and the intrinsics defined in 9.5 are available. This also implies support for the GE global flags which indicate byte-by-byte comparison results.

That is, #if defined(__ARM_NEON) would become #if defined(__ARM_FEATURE_SIMD32) || defined(__ARM_NEON).

Hi @jserv and @jonathanhue,

For ARMv8-A 32-bit CI build, maybe we can use Cortex-A32 as this processor only supports AArch32 (the thought is inspired by this simde issue).

You may read this commit of my fork, and I would like to use the commit as an entry point to solve this issue if both of you consider this solution is feasible.

I am not sure if ARMV8_A_32BIT is a good name to refer to the 32-bit ISA in ARMv8-A. Would it be better to specify A32 instead?

Wikipedia defines AArch32 as a 32-bit architecture, while A32 is commonly associated with an instruction set.
See also: Distinguishing between 32-bit and 64-bit A64 instructions

I am not sure if ARMV8_A_32BIT is a good name to refer to the 32-bit ISA in ARMv8-A. Would it be better to specify A32 instead?

Wikipedia defines AArch32 as a 32-bit architecture, while A32 is commonly associated with an instruction set. See also: Distinguishing between 32-bit and 64-bit A64 instructions

Hi @jserv,

Thanks for the information, I think the so-called "ARMv8 in 32-bit execution mode" should refer to A32 instruction set.

Thanks for the information, I think the so-called "ARMv8 in 32-bit execution mode" should refer to A32 instruction set.

Then, you can add A32 specific items in CI pipeline, so that we can validate.

Thanks for the information, I think the so-called "ARMv8 in 32-bit execution mode" should refer to A32 instruction set.

Then, you can add A32 specific items in CI pipeline, so that we can validate.

Got it. I will create a PR for this from now on.

Close this as it is completed in #620.