precision issue with rsqrt/sqrt/rcp/div
shord opened this issue · comments
Hi,
I been developing with SSE for many may years, and now got into porting some code to native ARM.
sse2neon works great, and I'm very happy with it, but I realized there are some precision incompatibility.
To my knowledge there are only 2 non exact instruction in SSE: rsqrt
and rcp
(p
and s
versions), both got 11 bit accuracy.
Neon got vrsqrteq_f32
and vrecpeq_f32
, that got 8 bit accuracy, 1% and the vrsqrtsq_f32
and vrecpsq_f32
used to improve accuracy. one Netwon-Raphson step will get about 16 correct bits and two steps seems to get to full 24 bits.
_mm_rsqrt_ps
currently with SSE2NEON_PRECISE_SQRT=0
is not doing any nr iterations, and with SSE2NEON_PRECISE_SQRT=1
two, it should to one. now we have options for 8 bits or 24, not the 16 that will match the expected 11 bits on Intel.
_mm_sqrt_ps
is expected to be exact and do 2 Netwon-Raphson iterations in non __aarch64__
path, and use the __aarch64__ vsqrtq_f32
if available. currently its ok for __aarch64__
with default SSE2NEON_PRECISE_SQRT=0
, but return only 8 bits accuracy for non __aarch64__
arch. with SSE2NEON_PRECISE_SQRT=1
its will use the long vrsqrteq_f32 + 2*vrsqrtsq_f32
even in __aarch64__
is available.
_mm_rcp_ps
is correct with default SSE2NEON_PRECISE_DIV
off, and doing pointless additional step with SSE2NEON_PRECISE_DIV
on.
_mm_div_ps
is that same situation as in _mm_sqrt_ps
, it expected to be exact, and use __aarch64__
vdivq_f32
if available, or do 2 Netwon-Raphson iterations.
unit test to bu updated too, 0.1% for rsqrt
/rcp
, and probably result*FLT_EPS(maybe *2-3) for div
/sqrt
Option to use lower precision can be added, but it shouldn't be on by default, and probably need to be more fine grained, for example rsqrt_8_BITS
rcp_8_BITS
div_16_bits
, and sqrt16_bits
. but I don't bee them that useful.
here is some code from Skia that seems to do it correct
https://chromium.googlesource.com/skia/+/chrome/m69/src/opts/SkNx_neon.h
and see reference
https://chromium.googlesource.com/skia/+/chrome/m69/src/opts/SkNx_sse.h
See WebRTC's vector_math.h for possible implementation.