Missing _mm_sad_pu8
jserv opened this issue · comments
Jim Huang commented
_mm_sad_pu8
would compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of dst.
Reference NEON implementation:
__m64 _mm_sad_pu8 (__m64 a, __m64 b)
{
uint16x8_t t = vpaddl_u8(vabd_u8((uint8x16_t) a, (uint8x16_t) b));
uint16_t r0 = t[0] + t[1] + t[2] + t[3];
return vset_lane_u16(r0, vdup_n_u16(0), 0);
}