DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

_mm_movemask_epi8 regression

jacksonrnewhouse opened this issue · comments

The aarch64 code path for _mm_movemask_epi8 introduced in #50 looks to be a regression when you actually compile it. The default behavior compiles to 7 instructions with no constants, while the "fast path" is 14 instructions plus a constant. Should it be reverted?

fast path: https://godbolt.org/z/41s54d
default: https://godbolt.org/z/xsYfz8

I'll make a simple time experiment of _mm_movemask_epi8.

I do the experiment on the ARMv8-A CPU, which is an ARM 64-bit architecture with optimization level 0.

graph

It turns out that the aarch64 code path does behave worse.
We should revert it for performance consideration.

The performance of optimization level 3 is measured as well.

graph

The performance does not have too much difference.
@jserv I think we need to decide which optimization level we should focus on for the future performance improvement.