DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

simplify blendv functions?

JishinMaster opened this issue · comments

Dear All,

Are we sure we need the extra "vshrq" in the blendv functions?

The current version :

FORCE_INLINE __m128 _mm_blendv_ps(__m128 _a, __m128 _b, __m128 _mask)
{
    // Use a signed shift right to create a mask with the sign bit
    uint32x4_t mask =
        vreinterpretq_u32_s32(vshrq_n_s32(vreinterpretq_s32_m128(_mask), 31));
    float32x4_t a = vreinterpretq_f32_m128(_a);
    float32x4_t b = vreinterpretq_f32_m128(_b);
    return vreinterpretq_m128_f32(vbslq_f32(mask, b, a));
}

The version I used with no problem so far (I may be wrong!) :

FORCE_INLINE __m128 _mm_blendv_ps(__m128 _a, __m128 _b, __m128 _mask)
{
    float32x4_t a = vreinterpretq_f32_m128(_a);
    float32x4_t b = vreinterpretq_f32_m128(_b);
    return vreinterpretq_m128_f32(vbslq_f32(vreinterpretq_s32_m128(_mask), b, a));
}

x86: _mm_blendv_ps

FOR j := 0 to 3
	i := j*32
	IF mask[i+31]
		dst[i+31:i] := b[i+31:i]
	ELSE
		dst[i+31:i] := a[i+31:i]
	FI
ENDFOR

The condition decision is related to the most significant bit.


arm: vbslq_f32

Bitwise Select. This instruction sets each bit in the destination SIMD&FP register
to the corresponding bit from the first source SIMD&FP register
when the original destination bit was 1, otherwise from the second source SIMD&FP register.

The condition decision is related to each bit.


Therefore, the vshrq_n_s32 is necessary.

Okay.
That would explain why it works for me since I use it mostly after comparer instructions which set everything to FF when true.

Thanks