Jackey-Huo / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sse2neon

Build Status

A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.

Introduction

sse2neon is a translator of Intel SSE (Streaming SIMD Extensions) intrinsics to Arm NEON, shortening the time needed to get an Arm working program that then can be used to extract profiles and to identify hot paths in the code. The header file sse2neon.h contains several of the functions provided by Intel intrinsic headers such as <xmmintrin.h>, only implemented with NEON-based counterparts to produce the exact semantics of the intrinsics.

Mapping and Coverage

Header file Extension
<mmintrin.h> MMX
<xmmintrin.h> SSE
<emmintrin.h> SSE2
<pmmintrin.h> SSE3
<tmmintrin.h> SSSE3
<smmintrin.h> SSE4.1
<nmmintrin.h> SSE4.2
<wmmintrin.h> AES

sse2neon aims to support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and AES extension.

In order to deliver NEON-equivalent intrinsics for all SSE intrinsics used widely, please be aware that some SSE intrinsics exist a direct mapping with a concrete NEON-equivalent intrinsic. However, others lack of 1-to-1 mapping, that means the equivalents are implemented using several NEON intrinsics.

For example, SSE intrinsic _mm_loadu_si128 has a direct NEON mapping (vld1q_s32), but SSE intrinsic _mm_maddubs_epi16 has to be implemented with 13+ NEON instructions.

Usage

  • Put the file sse2neon.h in to your source code directory.

  • Locate the following SSE header files included in the code:

#include <xmmintrin.h>
#include <emmintrin.h>

{p,t,s,n,w}mmintrin.h should be replaceable, but the coverage of these extensions might be limited though.

  • Replace them with:
#include "sse2neon.h"
  • Explicitly specify platform-specific options to gcc/clang compilers.
    • On ARMv8-A targets, you should specify the following compiler option: (Remove crypto and/or crc if your architecture does not support cryptographic and/or CRC32 extensions)
    -march=armv8-a+fp+simd+crypto+crc
    • On ARMv7-A targets, you need to append the following compiler option:
    -mfpu=neon

Compile-time Configurations

Considering the balance between correctness and peformance, sse2neon recognizes the following compile-time configurations:

  • SSE2NEON_PRECISE_MINMAX: Enable precise implementation of _mm_min_ps and _mm_max_ps. Turned off by default. If you need consistent results such as NaN special cases, define the macro as 1 before including sse2neon.h.

Run Built-in Test Suite

sse2neon provides a unified interface for developing test cases. These test cases are located in tests directory, and the input data is specified at runtime. Use the following commands to perform test cases:

$ make check

You can specify GNU toolchain for cross compilation as well. QEMU should be installed in advance.

$ make CROSS_COMPILE=aarch64-linux-gnu- check # ARMv8-A

or

$ make CROSS_COMPILE=arm-linux-gnueabihf- check # ARMv7-A

⚠️ Warning: The test suite is based on the little-endian architecture.

Add More Test Items

Once the conversion is implemented, the test can be added with the following steps:

  • File tests/impl.h

    Add the intrinsic in enum InstructionTest. The naming convention should be IT_MM_XXX. And place it in the correct classification with the alphabetical order. The classification can be referenced from Intel Intrinsics Guide.

  • File tests/impl.cpp

    • For the test name generation:

      Add the corresponding switch-case in getInstructionTestString().

      case IT_MM_XX:
          ret = "MM_XXX";
          break;
    • For running the test:

      Add the corresponding switch-case in runSingleTest().

      case IT_MM_XXX:
          ret = test_mm_xxx();
          break;
    • The test implementation:

      bool test_mm_xxx()
      {
          // The C implementation
          ...
      
          // The Neon implementation
          ret = _mm_xxx();
      
          // Compare the result of two implementations and return it
          ...
      }

Coding Convention

Use the command $ make indent to follow the coding convention.

Adoptions

Here is a partial list of open source projects that have adopted sse2neon for Arm/Aarch64 support.

  • Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
  • dab-cmdline provides entries for the functionality to handle Digital audio broadcasting (DAB)/DAB+ through some simple calls.
  • FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers.
  • parallel-n64 is an optimized/rewritten Nintendo 64 emulator made specifically for Libretro.
  • libscapi stands for the "Secure Computation API", providing reliable, efficient, and highly flexible cryptographic infrastructure.
  • MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets.
  • OBS Studio is software designed for capturing, compositing, encoding, recording, and streaming video content, efficiently.
  • OpenXRay is an improved version of the X-Ray engine, used in world famous S.T.A.L.K.E.R. game series by GSC Game World.
  • Pygame is cross-platform and designed to make it easy to write multimedia software, such as games, in Python.
  • srsLTE is an open source SDR LTE software suite.
  • Surge is an open source digital synthesizer.
  • XMRig is an open source CPU miner for Monero cryptocurrency.

Related Projects

Reference

Licensing

sse2neon is freely redistributable under the MIT License.

About

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

License:MIT License


Languages

Language:C 57.0%Language:C++ 42.2%Language:Makefile 0.4%Language:Shell 0.4%