Compiling arm run time libraries based on newlib selects inferior memcpy-stub.c instead of memcpy-armv7a.S

Question

Compiling arm run time libraries based on newlib selects inferior memcpy-stub.c instead of memcpy-armv7a.S

klaus1212 opened this issue 2 months ago · comments

Hi,

I am using the 18.1.3 release of LLVM and LLVM-embedded-toolchain-for-Arm. Additionally I use newlib-4.3.0.
I am building the following library variants for a baremetal arm 32bit cortex-a9 target on ubuntu.
For reference I have attached the git patch with the changes I need to setup above.
LLVM-embedded-toolchain-for-Arm.patch

add_library_variants_for_cpu(
armv7a
SUFFIX hard_neon
COMPILE_FLAGS "-mfloat-abi=hard -march=armv7a -mfpu=neon"
MULTILIB_FLAGS "--target=armv7-none-unknown-eabihf -mfpu=neon"
QEMU_MACHINE "none"
QEMU_CPU "cortex-a8"
QEMU_PARAMS "-m 1G"
BOOT_FLASH_ADDRESS 0x00000000
BOOT_FLASH_SIZE 0x1000
FLASH_ADDRESS 0x20000000
FLASH_SIZE 0x1000000
RAM_ADDRESS 0x21000000
RAM_SIZE 0x1000000
STACK_SIZE 4K
)
add_library_variants_for_cpu(
armv7a
SUFFIX thumb_hard_neon
COMPILE_FLAGS "-mfloat-abi=hard -march=armv7a -mfpu=neon"
MULTILIB_FLAGS "--target=thumbv7-none-unknown-eabihf -mfpu=neon"
QEMU_MACHINE "none"
QEMU_CPU "cortex-a8"
QEMU_PARAMS "-m 1G"
BOOT_FLASH_ADDRESS 0x00000000
BOOT_FLASH_SIZE 0x1000
FLASH_ADDRESS 0x20000000
FLASH_SIZE 0x1000000
RAM_ADDRESS 0x21000000
RAM_SIZE 0x1000000
STACK_SIZE 4K
)

My problem is that the run-time libraries seems to select an inferior memcpy in newlib called:
memcpy-stub.c.

I Basically determined it to be inferior during tests of memcpy.
When testing our current setup with gcc and newlib based run-time libraries memcpy is almost 6x faster compared to an identical test with clang and newlib based run-time libraries.
This is looks to be because our gcc build run-time libraries select #include "memcpy-armv7a.S" instead of the memcpy-stub.c selected in our clang build run-time libraries.

I have identified why memcpy-stub.S is selected as follows:
memcpy-stub.S is selected by
repos\newlib\newlib\libc\machine\arm\memcpy.S

Basically because
repos\newlib\newlib\include\arm-acle-compat.h is called with __ARM_ARCH defined.

When arm-acle-compat.h is called with __ARM_ARCH defined then
repos\newlib\newlib\include\arm-acle-compat.h does not define __ARM_FEATURE_UNALIGNED.

When __ARM_FEATURE_UNALIGNED is not defined
repos\newlib\newlib\libc\machine\arm\memcpy.S does not
#include "memcpy-armv7a.S"

but instead includes
memcpy-stub.c

I have not been able to determine for sure who defines __ARM_ARCH which is key to all of it.
Therefore I am hoping someone here in the forum knows if the above behavior is on purpose or how I can setup my run-time libraries to use memcpy-armv7a.S

Peter Smith · Answer 1 · Mon Jun 03 2024 21:21:43 GMT+0800 (China Standard Time)

The __ARM_ARCH macro is defined by the compiler in https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/ARM.cpp#L740
The __ARM_FEATURE_UNALIGNED is defined by the compiler in https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/ARM.cpp#L787 but only when -munaligned-access is selected.

So it looks like you may need to add -munaligned-access to the compilation flags for the library. I'm not sure at the moment whether adding it to the multilib flags will work as I'm not sure unaligned access can be used as a parameter to select multilib on right now.

For a general toolchain, if we only have one library variant for unaligned access, I think we'd want to compile without unaligned access for maximum compatibility (many systems disable unaligned access).

klaus1212 · Answer 2 · Tue Jun 04 2024 22:31:35 GMT+0800 (China Standard Time)

I have tried setting the -munaligned-access, it betters things since we are not running the implementation in memcpy-stub.c anymore then.

However now libc.a contains an

libc_a-aeabi_memcpy-armv7a.o, that DOES NOT use the arm vectorization instructions, and
libc_a-memcpy.o, that DOES use the arm vectorization instructions.

When compiling our source code with the above run-time library (libc) using clang, clang somehow maps our memcpy calls to __aeabi_memcpy (libc_a-aeabi_memcpy-armv7a.o). This is an issue because __aeabi_memcpy DOES NOT use the arm vectorization instructions and hence it is slower than it could be if it exploited the arm vectorization instructions.

I have raised the issue in the llvm forum see https://discourse.llvm.org/t/setting-mcpu-cortex-a9-mfpu-neon-for-arm-target-does-not-make-clang-pick-memcpy-optimized-for-the-co-processor/79336/5