qa_volk_32f_x3_sum_of_poly_32f sometimes fails on ARM

Question

qa_volk_32f_x3_sum_of_poly_32f sometimes fails on ARM

argilo opened this issue 8 months ago · comments

Example of failing CI run: https://github.com/gnuradio/volk/actions/runs/6498731447/job/17650685199

Errors exceed the allowed tolerance. I'm not sure whether the tolerance is too low, or the kernels aren't up to snuff.

I was able to reproduce on a Raspberry Pi by running the test in a loop:

while ctest -R qa_volk_32f_x3_sum_of_poly_32f --output-on-failure; do :; done

The failures occur often for a_neon and somewhat less frequently for neonvert.

Clayton Smith · Answer 1 · Sat Oct 14 2023 06:33:47 GMT+0800 (China Standard Time)

I think it's the kernels that are the problem. The generic implementation improves the accuracy of the floating point sums by splitting the input into 8 pieces and summing those separately before combining the sums. But the neon implementation does things the naive way, with a single running sum.

The neon implementation also multiplies every input value by the polynomial coefficients, instead of doing a single multiplication at the end. I suppose that might explain why it's more than three times slower than the generic implementation on my Raspberry Pi!

Clayton Smith · Answer 2 · Sat Oct 14 2023 06:44:11 GMT+0800 (China Standard Time)

The neon implementation also multiplies every input value by the polynomial coefficients, instead of doing a single multiplication at the end.

Actually the generic implementation does that too, so there must be another explanation. I see the neon implementation does a bunch of operations on vectors containing copies of the same input, which can't be efficient...

Clayton Smith · Answer 3 · Sat Oct 14 2023 07:40:05 GMT+0800 (China Standard Time)

Actually I think what may be happening here is that the generic implementation is not particularly accurate, but the x86 implementations copy its floating-point order of operations exactly, and so they accumulate the same errors and arrive at the same value. But the neon implementations do things in a different order, and are penalized for being far from the generic implementation's result, even though they're just as close to the true answer.

The answer (which is basically a summation) can end up being close to zero, in which case relative error can be very large even though absolute error is small. So this kernel should probably use an absolute error measurement.