Optimize fft_small for Intel CPUs

Question

Optimize fft_small for Intel CPUs

fredrik-johansson opened this issue 7 months ago · comments

Fredrik Johansson commented 7 months ago

According to Daniel, vroundpd is a bottleneck on Intel:

I noticed that things go noticably faster on my intel slab if the round function is replaced by basic arithmetic. I though for sure rounding couldn't be slower than add/sub, but, sure enough, intel have done it.

The cycle latencies on recent amd and intel chips are:

          amd       intel
round:    3         8
add/sub:  3         4
mul:      3         4
fmadd:    4         4

Albin Ahlbäck · Answer 1 · Wed Feb 21 2024 09:33:55 GMT+0800 (China Standard Time)

I looked into the generated code and it does not look that great (at least on Skylake). By unrolling and doing other stuff directly after roundings, we can latency penalties. Currently it wants to do a vroundpd directly followed by a vfnmadd132pd acting on the same register.

I couldn't guide the compiler to do what I wanted it to do, so I think we have to resort to inline assembly here.

Edit: GCC and Clang generate different sequences (the text above refers to GCC), but both are not optimal and cannot seem to be guided.

Albin Ahlbäck · Answer 2 · Thu Mar 14 2024 08:08:02 GMT+0800 (China Standard Time)

In relation to #1832, it would be nice to implement different subroutines in fft_small based on register width (128 bits for NEON, 256 bits for AVX2, etc.) and the number of such registers. Will probably be more efficient as I haven't seen GCC optimize enough IMO. For AVX-512, there is the instruction VCVTTPD2QQ to convert from float to double, as compared to vroundpd.