gnuradio / volk

The Vector Optimized Library of Kernels

Home Page:http://libvolk.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ORC implementation of volk_32fc_magnitude_32f returns incorrect results on ARM

argilo opened this issue · comments

Some users have noticed that Gqrx's AM demodulator does not work properly on the Raspberry Pi:

I dug into this, and found that on 32-bit ARM, volk_32fc_magnitude_32f's u_orc implementation produces an output of NaN when the input is zero. A sample of NaN then corrupts the state of an IIR filter, causing all of its future output to be NaN.

It appears the root cause is that ORC's sqrtf instruction produces NaN when the input value is zero, on 32-bit ARM. ORC's test suite even makes an explicit exception for this case:

https://gitlab.freedesktop.org/gstreamer/orc/-/blob/72b4699314e0a6eeeb29cea33d0410612c80f533/orc-test/orctest.c#L587-591

Thus it would seem that it is not safe for VOLK to pass zero into sqrtf.

Several other kernels also use sqrtf:

  • volk_16ic_magnitude_16i
  • volk_16ic_magnitude_32f
  • volk_32f_sqrt_32f

I would presume they are similarly broken on 32-bit ARM.

I did not observe the same issue on 64-bit Raspberry Pi OS, but I suspect it's because aarch64 support was only added do ORC in version 0.4.33, while Raspberry Pi OS has version 0.4.32.

I suspect the current test suite would not catch this bug, since kernels are tested with random data. Perhaps some special-case values (0, 1, -1, std::numeric_limits<float>::max, std::numeric_limits<float>::min, std::numeric_limits<float>::epsilon, etc) should be included as well.

That's a very difficult to solve issue. Basically, we'd need to work around the intended way ORC works. Or does this issue arise because of smth else? The comment in orctest.c implies this happens because of the specific implementation that is used in ORC on arm.

The comment in orctest.c implies this happens because of the specific implementation that is used in ORC on arm.

That's correct. Only the ARM implementation is broken. I'd consider it a serious bug that sqrt(0) = NaN, but it seems ORC doesn't since they added an exception in their test suite: https://gitlab.freedesktop.org/gstreamer/orc/-/merge_requests/66

Perhaps a suitable workaround would be to disable the four affected VOLK kernels on ARM. That change could be reverted if ORC someday fixes their ARM sqrt implementation.

Since the GQRX issues imply that the ORC implementation is slower than another, it doesn't hurt to disable these ORC kernels.

On a Raspberry Pi 3B+ running 64-bit Raspberry Pi OS:

RUN_VOLK_TESTS: volk_32fc_magnitude_32f(131071,1987)
generic completed in 6106.05 ms
a_generic completed in 6125.19 ms
neon completed in 1808.37 ms
neon_fancy_sweet completed in 2292.84 ms
u_orc completed in 11231.1 ms
Best aligned arch: neon
Best unaligned arch: neon
RUN_VOLK_TESTS: volk_32f_sqrt_32f(131071,1987)
neon completed in 932.665 ms
generic completed in 12600.2 ms
u_orc completed in 13221.3 ms
Best aligned arch: neon
Best unaligned arch: neon

ORC is worse than generic in both cases, and much worse than neon.

For volk_16ic_magnitude_16i and volk_16ic_magnitude_32f, the ORC kernels are already disabled on all platforms:

#ifdef LV_HAVE_ORC_DISABLED
extern void volk_16ic_magnitude_16i_a_orc_impl(int16_t* magnitudeVector,
const lv_16sc_t* complexVector,
float scalar,
unsigned int num_points);
static inline void volk_16ic_magnitude_16i_u_orc(int16_t* magnitudeVector,
const lv_16sc_t* complexVector,
unsigned int num_points)
{
volk_16ic_magnitude_16i_a_orc_impl(
magnitudeVector, complexVector, SHRT_MAX, num_points);
}
#endif /* LV_HAVE_ORC */

#ifdef LV_HAVE_ORC_DISABLED
extern void volk_16ic_s32f_magnitude_32f_a_orc_impl(float* magnitudeVector,
const lv_16sc_t* complexVector,
const float scalar,
unsigned int num_points);
static inline void volk_16ic_s32f_magnitude_32f_u_orc(float* magnitudeVector,
const lv_16sc_t* complexVector,
const float scalar,
unsigned int num_points)
{
volk_16ic_s32f_magnitude_32f_a_orc_impl(
magnitudeVector, complexVector, scalar, num_points);
}
#endif /* LV_HAVE_ORC */

damn, fancysweet isn't faster 👎

Regarding the benchmarks #203 points out that ORC benchmarks include some one time overhead. We might need to test this.
On the other hand, the ORC kernel test runs 1987 iterations and takes more than 13s while the NEON kernel finishes in less than 1s. I really hope this one time ORC overhead is not included...

This problem is now worse, because Debian 12 (and the latest Raspberry Pi OS) include ORC 0.4.33, which adds support for 64-bit ARM. As a result, these kernels are now broken on both 32-bit and 64-bit ARM.