Improve glm_quat_conjugate

Question

Improve glm_quat_conjugate

gottfriedleibniz opened this issue a year ago · comments

The current implementation of quat_conjugate is quite slow when compiled with SSE. For reference, here is clang's output.

Included in this link are alternate implementations, one of which can be easily extended to WASM and Neon, e.g.,

  float32x4_t mask = glmm_float32x4_init(-1.0f, -1.0f, -1.0f, 1.0);
  glmm_store(dest, vmulq_f32(glmm_load(q), mask));

Recep Aslantas · Answer 1 · Sat Aug 05 2023 17:42:02 GMT+0800 (China Standard Time)

@gottfriedleibniz nice suggestion thanks,

To avoid mul overhead ( if there is no special optimization for -1 ), it would be nice to do that without mul as your implementations in godbolt:

extern
void glm_quat_conjugate_simd(versor q, versor dest) {
#if 0
  __m128i mask = _mm_set_epi32(0, GLMM_NEGZEROf, GLMM_NEGZEROf, GLMM_NEGZEROf);
  glmm_store(dest, _mm_xor_ps(glmm_load(q), _mm_castsi128_ps(mask)));
#else
  __m128 mask = _mm_set_ps(1.0f, -1.0f, -1.0f, -1.0f);
  glmm_store(dest, _mm_mul_ps(glmm_load(q), mask));
#endif
}

with defining GLMM__SIGNMASKf or glmm_float32x4_SIGNMASK_NNNP in SEE, NEON and WASM ... we could write as:

CGLM_INLINE
void
glm_quat_conjugate(versor q, versor dest) {
#if defined(CGLM_SIMD)
  glmm_store(dest, glmm_xor(glmm_load(q), glmm_float32x4_SIGNMASK_NNNP));
#else
  dest[0] = -q[0];
  dest[1] = -q[1];
  dest[2] = -q[2];
  dest[3] =  q[3];
#endif
}

currently there is no glmm_xor in WASM, it would make thing easier to improve glmm_ api.

gottfriedleibniz · Answer 2 · Sat Aug 05 2023 20:19:08 GMT+0800 (China Standard Time)

Seems good.

Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.

Recep Aslantas · Answer 3 · Sat Aug 05 2023 23:22:34 GMT+0800 (China Standard Time)

Thanks,

Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.

Sure, simd can be ignored if there is no benefits on ARM ( or maybe on other platforms too ), as you said benchmark could be done asap.