recp / cglm

📽 Highly Optimized 2D / 3D Graphics Math (glm) for C

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve glm_quat_conjugate

gottfriedleibniz opened this issue · comments

The current implementation of quat_conjugate is quite slow when compiled with SSE. For reference, here is clang's output.

Included in this link are alternate implementations, one of which can be easily extended to WASM and Neon, e.g.,

  float32x4_t mask = glmm_float32x4_init(-1.0f, -1.0f, -1.0f, 1.0);
  glmm_store(dest, vmulq_f32(glmm_load(q), mask));

@gottfriedleibniz nice suggestion thanks,

To avoid mul overhead ( if there is no special optimization for -1 ), it would be nice to do that without mul as your implementations in godbolt:

extern
void glm_quat_conjugate_simd(versor q, versor dest) {
#if 0
  __m128i mask = _mm_set_epi32(0, GLMM_NEGZEROf, GLMM_NEGZEROf, GLMM_NEGZEROf);
  glmm_store(dest, _mm_xor_ps(glmm_load(q), _mm_castsi128_ps(mask)));
#else
  __m128 mask = _mm_set_ps(1.0f, -1.0f, -1.0f, -1.0f);
  glmm_store(dest, _mm_mul_ps(glmm_load(q), mask));
#endif
}

with defining GLMM__SIGNMASKf or glmm_float32x4_SIGNMASK_NNNP in SEE, NEON and WASM ... we could write as:

CGLM_INLINE
void
glm_quat_conjugate(versor q, versor dest) {
#if defined(CGLM_SIMD)
  glmm_store(dest, glmm_xor(glmm_load(q), glmm_float32x4_SIGNMASK_NNNP));
#else
  dest[0] = -q[0];
  dest[1] = -q[1];
  dest[2] = -q[2];
  dest[3] =  q[3];
#endif
}

currently there is no glmm_xor in WASM, it would make thing easier to improve glmm_ api.

Seems good.

Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.

Thanks,

Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.

Sure, simd can be ignored if there is no benefits on ARM ( or maybe on other platforms too ), as you said benchmark could be done asap.