DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing _mm_sad_pu8

jserv opened this issue · comments

_mm_sad_pu8 would compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of dst.

Reference NEON implementation:

__m64 _mm_sad_pu8 (__m64 a, __m64 b)
{
    uint16x8_t t = vpaddl_u8(vabd_u8((uint8x16_t) a, (uint8x16_t) b));
    uint16_t r0 = t[0] + t[1] + t[2] + t[3];
    return vset_lane_u16(r0, vdup_n_u16(0), 0);
}