Optimize fft_small for Intel CPUs
fredrik-johansson opened this issue · comments
According to Daniel, vroundpd
is a bottleneck on Intel:
I noticed that things go noticably faster on my intel slab if the round function is replaced by basic arithmetic. I though for sure rounding couldn't be slower than add/sub, but, sure enough, intel have done it.
The cycle latencies on recent amd and intel chips are:
amd intel
round: 3 8
add/sub: 3 4
mul: 3 4
fmadd: 4 4
I looked into the generated code and it does not look that great (at least on Skylake). By unrolling and doing other stuff directly after roundings, we can latency penalties. Currently it wants to do a vroundpd
directly followed by a vfnmadd132pd
acting on the same register.
I couldn't guide the compiler to do what I wanted it to do, so I think we have to resort to inline assembly here.
Edit: GCC and Clang generate different sequences (the text above refers to GCC), but both are not optimal and cannot seem to be guided.
In relation to #1832, it would be nice to implement different subroutines in fft_small
based on register width (128 bits for NEON, 256 bits for AVX2, etc.) and the number of such registers. Will probably be more efficient as I haven't seen GCC optimize enough IMO. For AVX-512, there is the instruction VCVTTPD2QQ
to convert from float to double, as compared to vroundpd
.