intel / hexl

Intel:registered: Homomorphic Encryption Acceleration Library accelerates modular arithmetic operations used in homomorphic encryption

Home Page:https://intel.github.io/hexl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How does hexl performs against NFLLib?

rogpld opened this issue · comments

I was wondering if your benchmarks could be tweaked a little to show the performance against NFLLib. It seems like NFLLib is more popular and the results would give some clarifications on which one to choose.

Thanks @rogpld for your interest. I see several benchmarks in NFLLib, e.g. ntt_perfs, demo_main_func which look similar to those in HEXL. Any comparison in specific you're interested in?

Indeed. There are several benchmarks and I would have done them myself, but unfortunately I don't access to a machine supporting the new instructions being used. Here is what I propose.

Is the function ComputeForward the hexl equivalent of the ntt in NFLlib? If so, would it be too much work to benchmark both functions? The NTT is the most cumbersome operation so in my view that makes most sense. I see that on the hexl benchmarks the function being evaluated is the internal one here. Does that makes much difference from the benchmark of the API call on ComputeForward?

Thanks for your work.

Yes, HEXL's ComputeForward calls ForwardTransformToBitReverseAVX512<NTT::NTTImpl::s_ifma_shift_bits> when AVX512IFMA is supported. I agree, this looks like a fair comparison the NFLLib's ntt_perfs.

When I modified NFLLib's ntt_perfs.cpp to read

run<1024, 64, uint64_t>();
run<4096, 64, uint64_t>();
run<8192, 64, uint64_t>();
run<16384, 64, uint64_t>();

and configure via cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=. -DNFL_OPTIMIZED=ON -DCMAKE_CXX_COMPILER=clang++-10, running ./tests/ntt_perfs yields

Polynomials of degree 1024 with 64 bit coefficients and 64 bit limbs
======================================================================
Time per NTT (org): 7.16438 us
Time per NTT (lib): 6.9047 us

Polynomials of degree 4096 with 64 bit coefficients and 64 bit limbs
======================================================================
Time per NTT (org): 35.3532 us
Time per NTT (lib): 33.6478 us

Polynomials of degree 8192 with 64 bit coefficients and 64 bit limbs
======================================================================
Time per NTT (org): 76.9284 us
Time per NTT (lib): 74.0504 us

Polynomials of degree 16384 with 64 bit coefficients and 64 bit limbs
======================================================================
Time per NTT (org): 168.195 us
Time per NTT (lib): 161.325 us

This looks a little better than the native C++ implementation in HEXL (see Table 1 in our whitepaper)
Screen Shot 2021-04-15 at 9 36 55 AM.

Note: these aren't official performance numbers, just ad-hoc measurements.

Cool. Can you enable AVX2 on NFLLib as well with -DNFL_OPTIMIZED=ON -DNTT_AVX2? Also it depends on the size of the prime modulus I believe. NFLlib is using a 64 bit modulus, what is the size of the one in the HEXL?

What is the difference in performance in calling ForwardTransformToBitReverseAVX512 and ComputeForward?

Looks like -DNFL_OPTIMIZED=ON implies -DNTT_AVX2 on available machines (https://github.com/quarkslab/NFLlib/blob/master/CMakeLists.txt#L8-L21). Indeed, I see

-- Tring to use optimized version of NFL
-- Using AVX vector engine

when compiling with -DNFL_OPTIMIZED=ON .

There is minimal performance difference between
ForwardTransformToBitReverseAVX512 and ComputeForward. The public API exposes only ComputeForward, which will dispatch to ForwardTransformToBitReverseAVX512/ForwardTransformToBitReverse64 as available (see

void NTT::NTTImpl::ComputeForward(uint64_t* result, const uint64_t* operand,
uint64_t input_mod_factor,
uint64_t output_mod_factor) {
HEXL_CHECK(m_fwd_bit_shift == s_ifma_shift_bits ||
m_fwd_bit_shift == s_default_shift_bits,
"Bit shift " << m_fwd_bit_shift << " should be either "
<< s_ifma_shift_bits << " or "
<< s_default_shift_bits);
HEXL_CHECK(result != nullptr, "result == nullptr");
HEXL_CHECK(operand != nullptr, "operand == nullptr");
HEXL_CHECK_BOUNDS(
operand, m_degree, m_p * input_mod_factor,
"value in operand exceeds bound " << m_p * input_mod_factor);
if (result != operand) {
std::memcpy(result, operand, m_degree * sizeof(uint64_t));
}
#ifdef HEXL_HAS_AVX512IFMA
if (has_avx512ifma && m_fwd_bit_shift == s_ifma_shift_bits &&
(m_p < s_max_fwd_ifma_modulus && (m_degree >= 16))) {
const uint64_t* root_of_unity_powers = GetRootOfUnityPowersPtr();
const uint64_t* precon_root_of_unity_powers =
GetPrecon52RootOfUnityPowersPtr();
HEXL_VLOG(3, "Calling 52-bit AVX512-IFMA NTT");
ForwardTransformToBitReverseAVX512<s_ifma_shift_bits>(
result, m_degree, m_p, root_of_unity_powers,
precon_root_of_unity_powers, input_mod_factor, output_mod_factor);
return;
}
#endif
#ifdef HEXL_HAS_AVX512DQ
if (has_avx512dq && m_degree >= 16) {
HEXL_VLOG(3, "Calling 64-bit AVX512 NTT");
const uint64_t* root_of_unity_powers = GetRootOfUnityPowersPtr();
const uint64_t* precon_root_of_unity_powers =
GetPrecon64RootOfUnityPowersPtr();
ForwardTransformToBitReverseAVX512<s_default_shift_bits>(
result, m_degree, m_p, root_of_unity_powers,
precon_root_of_unity_powers, input_mod_factor, output_mod_factor);
return;
}
#endif
HEXL_VLOG(3, "Calling 64-bit default NTT");
const uint64_t* root_of_unity_powers = GetRootOfUnityPowersPtr();
const uint64_t* precon_root_of_unity_powers =
GetPrecon64RootOfUnityPowersPtr();
ForwardTransformToBitReverse64(result, m_degree, m_p, root_of_unity_powers,
precon_root_of_unity_powers, input_mod_factor,
output_mod_factor);
}
).
The ForwardTransformToBitReverseAVX512 benchmarks exist just to make sure the AVX512 implementation is called.

The AVX512DQ implementation is valid for < ~61 bit primes.
The AVX512IFMA implementation is valid for < ~50 bit primes.
Currently, HEXL supports only the data/modulus representation as a uint64_t.

See below for an idea of the overhead of ForwardTransformToBitReverseAVX512 vs. ComputeForward.

// Calls ForwardTransformToBitReverseAVX512
BM_FwdNTT_AVX512DQ/1024/1/min_time:1.000            2.86 us         2.86 us       489442
BM_FwdNTT_AVX512DQ/4096/1/min_time:1.000            13.5 us         13.5 us       103822
BM_FwdNTT_AVX512DQ/16384/1/min_time:1.000           62.7 us         62.7 us        22341
// Calls ntt.ComputeForward
BM_FwdNTTInPlace/1024/min_time:1.000                2.84 us         2.83 us       493098
BM_FwdNTTInPlace/4096/min_time:1.000                13.4 us         13.4 us       104811
BM_FwdNTTInPlace/16384/min_time:1.000               62.2 us         62.1 us        22540

Thank you so much for the results and effort. Seems like HEXL is the best choice when IFMA or DQ are available. I have just a small objection regarding this

run<1024, 64, uint64_t>();
run<4096, 64, uint64_t>();
run<8192, 64, uint64_t>();
run<16384, 64, uint64_t>();

They should be the same as in the HEXL. I don't believe it will make much difference. But what I meant before is that if your are comparing with a 50 bit modulus on HEXL, don't you have to set as?

run<1024, 50, uint32_t>();
run<4096, 50, uint32_t>();
run<8192, 50, uint32_t>();
run<16384, 50, uint32_t>();

Thank you again.

I'm not so familiar with how NFLLib works, but I'm seeing compiler errors on the setup you mention. Regardless, NFLLib appears to have the same implementation for each modulus, so I don't expect any runtime changes. I'd be happy to benchmark a fork of NFLLib if you have one.

My bad, you have to also comment the baseline implementation call. I have done it in this gist. This will only report the performance results for NFLlib. I don't know if there will be any change in runtime. On my machine there seems to be some change, but I don't have AVX2.

Thanks, with this change, I see

Polynomials of degree 1024 with 50 bit coefficients and 32 bit limbs
======================================================================
Time per NTT (lib): 1.95534 us

Polynomials of degree 4096 with 50 bit coefficients and 32 bit limbs
======================================================================
Time per NTT (lib): 8.58862 us

Polynomials of degree 8192 with 50 bit coefficients and 32 bit limbs
======================================================================
Time per NTT (lib): 18.047 us

Polynomials of degree 16384 with 50 bit coefficients and 32 bit limbs
======================================================================
Time per NTT (lib): 39.5937 us

So, there is a good difference. It seems that NFLlib would a good comparison as well, beside the NTL as you have already done in the paper. Would you consider listing up to date performance results on this repository?

Taking a closer look, it seems to me the modulus is only 30 bits in this case, e.g. adding
std::cout << "p.get_modulus(0) = " << p.get_modulus(0) << "\n"; here
yields p.get_modulus(0) = 1073479681

This stems from here
where params<T>::kModulusBitsize = 30 and we are passing in AggregatedModulusBitSize=50. 50/30 => 1 coefficient. This looks like incorrect usage: aggregatedModulusBitsize is the size of the composite modulus to be used, // it must be a multiple of params<T>::kModulusBitsize from here.
So I would expect the most recent numbers above to reflect performance on 30-bit moduli.

Does that make sense?

That seems right. Because they use l polynomials through the CRT according to the original paper. So we should be actually benchmarking results for polynomial multiplication to be fair, which would have the aggregated modulus size. I'm not too familiar with either libraries (NTL/NFLLib), just starting a new project and trying to figure out the best option.

So, just to confirm. Does the modulus size have an impact on the results for HEXL? We've seen that it does for NFLlib. I was not able to find the exact modulus size used in the results of the whitepaper. It might not make a difference for HEXL, but for other libraries it does.

In HEXL, primes < 2^50 will use AVX512IFMA if available. Primes > 2^50 will use AVX512DQ if available.
The AVX512IFMA implementation is ~2x faster than the AVX512DQ implementation (see Table 1 above #3 (comment)). In absence of AVX512, HEXL uses a default implementation, which does not discriminate between modulus size, i.e. it has the same runtime for any word-sized modulus.