DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Correct _mm_min_ps, _mm_max_ps implementation

syoyo opened this issue · comments

_mm_min_ps and _mm_max_ps cannot be accurately emulated with single vminq_f32/vmaxq_f32(and also vminnmq_f32 and vmaxnmq_f32) instruction.

We need special handling when both inputs are zeros and either input is NaN.

https://tavianator.com/fast-branchless-raybounding-box-intersections-part-2-nans/
https://www.felixcloutier.com/x86/minps

Here is an implementation of vmin/vmax which emulates _mm_min_ps/_mm_max_ps exactly(as far as I've tested)

lighttransport/embree-aarch64@e4a2f68

Thank @syoyo for figuring out the accurate implementation. Can you send pull request as well?

@jserv It requires some more tests to verify the implementation(You can also write tests). After the verification, I'm planning to send a PR.