DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Comment regarding _mm_dp_pd

jnettlet opened this issue · comments

This comment was pointed out to me regarding the sse2neon implemention "There's a small issue with the _mm_dp_pd impl, though: you do the mul first and then mask the result according to imm8[4:5]. That is the opposite order to what the sse41 definition says; the sse41 original will avoid unwanted ops, eg. on NaNs, your impl won't"

Reviewing https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_dp_pd&expand=2714 that does seem to be the case.

The _mm_dp_pd implementation is slightly different from the description on Intel Intrinsic Guide _mm_dp_pd.
Currently, I do not find any other NEON intrinsic suitable to replace the implementation to avoid unwanted ops.
Hence, to follow the definition of _mm_dp_pd, at least the if-else statement would be added, which would cause the additional branch.

@jnettlet I guess you are concerned about the multiplication of signaling NaN (SNaN) that causes the exception or sets the invalid operation flag?

That is the concern, although I currently have no direct implementations that trigger this in any of the projects I am testing with sse2neon.