Support for ARMv8 in 32-bit execution mode

Question

Support for ARMv8 in 32-bit execution mode

jonathanhue opened this issue 2 years ago · comments

I just started using sse2neon last week to migrate some code, and found that I needed some changes to support ARMv8 in 32-bit execution mode. The code seems to assume all 32-bit targets are ARMv7 and ARMv8 is always in 64-bit execution mode. My target is 32-bit, but running on ARMv8 (Cortex-A72/A73), and I want to take advantage of new ARMv8 instructions available in 32-bit execution mode, for example, directed rounding on vectors. An example command line is

$ clang++-10 -O2 -mcpu=cortex-a73 -mfpu=neon-fp-armv8 -dM -E - < /dev/null |egrep -i 'arm_arch |_state |directed'
#define __ARM_32BIT_STATE 1
#define __ARM_ARCH 8
#define __ARM_FEATURE_DIRECTED_ROUNDING 1

$ clang-10 -v 2>&1 | grep -i target
Target: armv7l-unknown-linux-gnueabihf

Right around line 125 the compiler will bail out because my target will define __ARM_ARCH 8, but __aarch64__ is not defined because my target is 32-bit. So I had to add something like this:

#elif __ARM_ARCH == 8
#if !defined(__ARM_NEON) || !defined(__ARM_NEON__)
#error "You must enable NEON instructions (e.g. -mfpu=fp-neon-armv8) to use SSE2NEON."
#endif

Then there are some features that are conditional on __aarch64__, but which are available on ARMv8 in 32-bit mode. For example, around line 4000, there is this:

// https://msdn.microsoft.com/en-us/library/vstudio/xdc42k5e(v=vs.100).aspx
// *NOTE*. The default rounding mode on SSE is 'round to even', which ARMv7-A
// does not support! It is supported on ARMv8-A however.
FORCE_INLINE __m128i _mm_cvtps_epi32(__m128 a)
{
#if defined(__aarch64__)
    switch (_MM_GET_ROUNDING_MODE()) {
    case _MM_ROUND_NEAREST:

I believe this would be better if conditional on __ARM_FEATURE_DIRECTED_ROUNDING instead of __aarch64__, since it is available in 32-bit execution mode on ARMv8. There are probably other examples of things like this. The directed rounding intrinsics are not available for 32-bit ARM targets on any version of GCC as far as I can tell, but are available in clang back to version 8 at least (tested on Godbolt Compiler Explorer), and on GCC for aarch64 targets (and __ARM_FEATURE_DIRECTED_ROUNDING is properly set to indicate support).

Jim Huang · Answer 1 · Sun Mar 13 2022 04:37:44 GMT+0800 (China Standard Time)

Thank @jonathanhue for pointing this out. Can you send some pull requests which enable 32-bit support for ARMv8-A based cores? It is definitely involved in diverse tests to confirm.

I am thinking of the feasibility to extend the existing GitHub Actions for such scenario.

jonathanhue · Answer 2 · Sun Mar 13 2022 06:02:16 GMT+0800 (China Standard Time)

Sure, expect something in the next day or two.

jonathanhue · Answer 3 · Tue Mar 15 2022 01:19:32 GMT+0800 (China Standard Time)

To identify the new intrinsics available on ARMv8 while in 32-bit execution mode, I went to this page: https://arm-software.github.io/acle/neon_intrinsics/advsimd.html and then did a grep for anything that was listed as "A32/A64" (i.e. not available for v7). That gave me 582 intrinsics, of which 578 were unique.

Because there are so many new intrinsics, the approach I would take is to find everything that is currently conditional on __aarch64__, and see if the NEON intrinsics used are available on A32. Those would be the candidates that could be changed from being conditional on __aarch64__ to being conditional on __ARM_ARCH == 8, or some other preprocessor macro such as __ARM_FEATURE_DIRECTED_ROUNDING instead. In theory, this wouldn't break anything, since currently __ARM_ARCH is going to be 8 if __aarch64__ is defined, and 32-bit on ARMv8 wasn't supported before.

a32new_uniq.txt

Attached file is the list of 578 new intrinsics available for A32. Any feedback on this approach is welcome.

Edit: Also, this made more complicated by the fact that GCC doesn't support many of the new intrinsics when compiling for 32-bit while clang does.

jonathanhue · Answer 4 · Tue Mar 15 2022 09:10:02 GMT+0800 (China Standard Time)

I think a low-risk first step would be to enable use of the new intrinsics whose support is indicated by one of the __ARM_FEATURE_XXX macros. Compared to compiling for ARMv7, compiling for 32-bit ARMv8 adds the following with clang:

#define __ARM_FEATURE_CRC32 1
#define __ARM_FEATURE_DIRECTED_ROUNDING 1
#define __ARM_FEATURE_IDIV 1
#define __ARM_FEATURE_NUMERIC_MAXMIN 1

GCC adds:

#define __ARM_FEATURE_IDIV 1
#define __ARM_FEATURE_NUMERIC_MAXMIN 1

and only defines __ARM_FEATURE_CRC32 if the -march includes it, i.e. -march=armv8-a+crc. CRC is enabled by default for armv8 on clang, and requires -mnocrc to disable it.

For both gcc and clang, specyfing "-mfpu=crypto-neon-fp-armv8" enables __ARM_FEATURE_CRYPTO.

jonathanhue · Answer 5 · Wed Mar 16 2022 06:19:35 GMT+0800 (China Standard Time)

I've made the changes to utilize the new rounding and CRC intrinsics available for 32-bit targets on ARMv8. The directed rounding intrinsics help quite a bit; I have a function that now processes samples at a 35% higher rate. I did run into a problem with older versions of GCC and an include file that GCC itself supplies.

The CRC intrinsics are declared in a separate include file (arm_acle.h), and I want to include it in sse2neon.h so the SSE CRC intrinsics such as _mm_crc32_u16 can call them. Due to this bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81497, some older versions of GCC will fail to compile without the -fpermissive flag. It is fixed in 8.2 and later versions, but broken in 8.1 and 7.X. So I'm not sure what to do. I'm inclined to leave it as-is, rather than disable their use when a broken compiler is detected, so that the user can take advantage of them by using the -fpermissive flag.

Jim Huang · Answer 6 · Thu Jun 02 2022 11:37:07 GMT+0800 (China Standard Time)

The CRC intrinsics are declared in a separate include file (arm_acle.h), and I want to include it in sse2neon.h so the SSE CRC intrinsics such as _mm_crc32_u16 can call them. Due to this bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81497, some older versions of GCC will fail to compile without the -fpermissive flag. It is fixed in 8.2 and later versions, but broken in 8.1 and 7.X. So I'm not sure what to do. I'm inclined to leave it as-is, rather than disable their use when a broken compiler is detected, so that the user can take advantage of them by using the -fpermissive flag.

@jonathanhue, regarding gcc prior to version 8.2, how do you think to propose the potential fixes?

jonathanhue · Answer 7 · Sun Aug 21 2022 02:17:36 GMT+0800 (China Standard Time)

I had to go back and review the problem. The problem is isolated to 32-bit ARMv8 builds. The bug doesn't exist when compiling for 64-bit, nor on ARMv7 since ARMv7 doesn't have the CRC instructions. So previously supported platforms aren't affected by this.

I don't have a great solution, but there is probably a less bad solution than the current behavior. The current behavior is that the #include of <arm_acle.h> results in a compiler error unless "-fpermissive" is used, on versions of GCC prior to 8.2. This does not allow users who either 1) are not using _mm_crc32_u*() intrinsics, or 2) are using them but are not performance sensitive from compiling without -fpermissive. Compiling with -fpermissive may be undesirable for several reasons, so ideally it shouldn't be forced upon users.

What I think might be better than the current behavior is for the older compilers which cannot #include <arm_acle.h> without error, is to not include the header by default, and fall back to the C implementation. This way their code will compile and they won't have to use -fpermissive; they'll just not get the improved performance. Then there should be a preprocessor macro that users on the old compiler can define to include the header, and get the accelerated implementation, and they will have to know they have to also use -fpermissive when compiling code that #includes <arm_acle.h> (and can isolate it to a small file if desired).

Jim Huang · Answer 8 · Tue Oct 11 2022 04:58:46 GMT+0800 (China Standard Time)

By the way, shall we support __ARM_FEATURE_SIMD32?
ARM C language extensions specification describes the __ARM_FEATURE_SIMD32 macro as follows:

__ARM_FEATURE_SIMD32 is defined to 1 if the 32-bit SIMD instructions are supported and the intrinsics defined in 9.5 are available. This also implies support for the GE global flags which indicate byte-by-byte comparison results.

That is, #if defined(__ARM_NEON) would become #if defined(__ARM_FEATURE_SIMD32) || defined(__ARM_NEON).

Cuda Chen · Answer 9 · Sat Nov 18 2023 12:52:22 GMT+0800 (China Standard Time)

Hi @jserv and @jonathanhue,

For ARMv8-A 32-bit CI build, maybe we can use Cortex-A32 as this processor only supports AArch32 (the thought is inspired by this simde issue).

You may read this commit of my fork, and I would like to use the commit as an entry point to solve this issue if both of you consider this solution is feasible.

Jim Huang · Answer 10 · Sat Nov 18 2023 13:02:06 GMT+0800 (China Standard Time)

I am not sure if ARMV8_A_32BIT is a good name to refer to the 32-bit ISA in ARMv8-A. Would it be better to specify A32 instead?

Wikipedia defines AArch32 as a 32-bit architecture, while A32 is commonly associated with an instruction set.
See also: Distinguishing between 32-bit and 64-bit A64 instructions

Cuda Chen · Answer 11 · Sat Nov 18 2023 15:44:34 GMT+0800 (China Standard Time)

I am not sure if ARMV8_A_32BIT is a good name to refer to the 32-bit ISA in ARMv8-A. Would it be better to specify A32 instead?

Wikipedia defines AArch32 as a 32-bit architecture, while A32 is commonly associated with an instruction set. See also: Distinguishing between 32-bit and 64-bit A64 instructions

Hi @jserv,

Thanks for the information, I think the so-called "ARMv8 in 32-bit execution mode" should refer to A32 instruction set.

Jim Huang · Answer 12 · Sat Nov 18 2023 21:39:34 GMT+0800 (China Standard Time)

Thanks for the information, I think the so-called "ARMv8 in 32-bit execution mode" should refer to A32 instruction set.

Then, you can add A32 specific items in CI pipeline, so that we can validate.

Cuda Chen · Answer 13 · Sat Nov 18 2023 21:45:24 GMT+0800 (China Standard Time)

Thanks for the information, I think the so-called "ARMv8 in 32-bit execution mode" should refer to A32 instruction set.

Then, you can add A32 specific items in CI pipeline, so that we can validate.

Got it. I will create a PR for this from now on.

Cuda Chen · Answer 14 · Sun Dec 03 2023 11:57:23 GMT+0800 (China Standard Time)

Close this as it is completed in #620.