(Rust binding) Repeated invocation of EltwiseFMAModAVX512 (with different data) in loop has unexpected performance regression
Janmajayamall opened this issue · comments
I am weiting rust bindings for hexl here. I have added support for NTT operations and some elwise operations. However, I am running into issues with elwise operations with prime
(ie q
) set to 50 bits. To see what's wrong you can clone the repository and run cargo bench modulus/elwise_fma_mod
. This will run benches inside benches/modulus.rs with prefix elwise_fma_mod
which uses EltwiseFMAModAVX512
internally and will produce following looking output
modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=1
time: [40.942 µs 40.978 µs 41.017 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=3
time: [122.65 µs 122.72 µs 122.80 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=5
time: [205.28 µs 205.52 µs 205.76 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=15
time: [616.00 µs 616.57 µs 617.19 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=1
time: [9.6013 µs 9.6061 µs 9.6115 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=3
time: [27.549 µs 27.647 µs 27.770 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=5
time: [81.501 µs 81.550 µs 81.607 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=15
time: [284.54 µs 287.81 µs 291.38 µs]
I have reduced the output to only necessary items: bench name and time.
bench modulus/elwise_fma_mod_2d/*
benches this function. The function simply takes two 2-dimensional (row-major) matrix r0
, r1
, and a scalar and calls elwise_fma_mod
row-wise. elwise_fma_mod
internally calls EltwiseFMAModAVX512
here.
n
is row size, fixed at 32768. logq
is bits in prime and mod_size
is no. of rows in matrix. For example, modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=1
calls elwise_fma_mod
once (since it has only 1 row) with a 60 bit prime and vector size 32768 and modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=3
calls elwise_fma_mod
thrice for 3 different rows (since mod_size is 3) with rest of parameters set to same. Hence we must expect performance of modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=3
to be around 3x of modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=1
. Indeed it is. Same holds for other benches with n=32768 and logq=60 and mod_size=5 / 15.
But things behave differently when logq is set to 50 bits (ie when EltwiseFMAModAVX512 uses IFMA instead of DQ). modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=3
is 3x of modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=1
as expected, but same pattern does not holds when mod_size is either 5 or 15 (for mod_size=5 it should be around 50µs but is 81µs and for mod_size=15 it should be 145µs but is 287µs). I have tried for other mod_size
s and it gets worse as mod_size
increases, that is as no. of rows increase.
I am unable to detect what causes this for 50 bit primes. Do you have any pointers? Or is this expected with IFMA?
Thanks!
Hello @Janmajayamall. Unfortunately I no longer have the machines to run HEXL at full (Using AVX512). I can tell you modular reduction works different depending on BitShift variable.
Look at functions on fma_mod that depends on BitShift here: https://github.com/intel/hexl/blob/development/hexl/eltwise/eltwise-fma-mod-avx512.cpp
BitShift definition happens here https://github.com/intel/hexl/blob/development/hexl/eltwise/eltwise-fma-mod.cpp
Would you have the same behavior using logq = 48 or 46? just curious.
Regards,
José Rojas
Hi @Janmajayamall,
You mentioned that you are trying to use the Intel Advanced Vector Extensions 512 Integer Fused Multiply Add (AVX512-IFMA52) instructions. These were introduced in the 3rd Gen Intel® Xeon® Scalable Processors (and onwards), so checking which CPU manufacturer and type you are using will be important.
The AVX512-IFMA52 should only be used for primes below 50–52 bits, assuming it suffices for your computation.
For more information on how HEXL uses the AVX512-IFMA52, please refer to:
https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-intel-hexl.html
and
https://arxiv.org/pdf/2103.16400.pdf
Regards,
Flavio
@Janmajayamall
A description of the AVX512-IFMA52 intrinsics can be found here: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avx512techs=AVX512IFMA52&cats=Arithmetic
Would you have the same behavior using logq = 48 or 46? just curious.
modulus/elwise_fma_mod_2d/n=32768/logq=48/mod_size=1
time: [9.2788 µs 9.3188 µs 9.3523 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=48/mod_size=3
time: [28.762 µs 28.882 µs 28.987 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=48/mod_size=5
time: [76.355 µs 76.662 µs 76.946 µs]
modulus/elwise_fma_mod_2d/n=32768/logq=48/mod_size=15
time: [273.11 µs 276.14 µs 279.43 µs]
Yeah it behaves same for logq=48 and can confirm same for logq=46.
I don't suspect that this is due to calling code from rust (but will still compare by implementing same in C++).
If I understand correctly the line here sets Bitshift value to 52 and uses IFMA, right?
You mentioned that you are trying to use the Intel Advanced Vector Extensions 512 Integer Fused Multiply Add (AVX512-IFMA52) instructions. These were introduced in the 3rd Gen Intel® Xeon® Scalable Processors (and onwards), so checking which CPU manufacturer and type you are using will be important.
I am using C3 machine on GCP (4th Gen Intel Xeon Scalable processor) that supports AVX512-IFMA. I don't think there are additional configs I need to enable for hexl, or am I missing something?
I am curious whether you have some ideas around what can cause this?
Thanks!
The 4th Gen Intel Xeon Scalable processor does support AVX512-IFMA instructions. But, just in case, assuming you are using Linux, can you check with the command "lscpu".
As far as how to make use of HEXL in an FHE library, I would suggest you study the integration of HEXL with MS SEAL and/or with OpenFHE.