Fast modular squaring

Question

Fast modular squaring

unzvfu opened this issue 4 years ago · comments

For sufficiently large numbers, squaring a number can be made up to ~40% faster than simply multiplying it with itself (for example, see here). We should investigate whether we can obtain some performance benefit for numbers of the size used in Plonky.

Note that the existing Montgomery modmul implementation is CIOS (i.e. interleaves multiplication and reduction). To implement fast squaring we first need Montogomery REDC and we lose the (mild) performance benefits of CIOS layout.
We can combine the fast squaring method with modular reduction with the CIOS layout in a fairly straight forward manner. This is mentioned (but not explained) in Acar's thesis, Section 2.3. Another useful reference is this post by the Goff developers (that also includes a cute trick to skip some additions for some moduli).

Daniel Lubarov · Answer 1 · Mon Aug 10 2020 08:48:41 GMT+0800 (China Standard Time)

To implement fast squaring we first need Montogomery REDC and we lose the (mild) performance benefits of CIOS layout.

That sounds good to me -- I would expect REDC-only to be faster overall than CIOS-only since a lot of our field ops are squarings and cubings.

Hamish Ivey-Law · Answer 2 · Tue Aug 11 2020 20:38:18 GMT+0800 (China Standard Time)

I would expect REDC-only to be faster overall than CIOS-only since a lot of our field ops are squarings and cubings.

It really depends on how much faster the squaring code is. mult+REDC is slower than CIOS (though not by a lot), so the speedup from squaring has to cover that gap, and then some more in order to be better. Logically squarings have a bit more than half as many (sub)multiplications as full multiplication, but in reality you never get a 2-fold speedup. I said ~40% above, although honestly that's just a guess. You tend to get better results with bigger numbers, and our numbers are not especially big in the scheme of things. We'll have to implement and see.

Hamish Ivey-Law · Answer 3 · Mon Sep 21 2020 20:53:20 GMT+0800 (China Standard Time)

So I was actually completely wrong about fast squaring needing to be separated from reduction (i.e. computing a wide square followed by a REDC; called Separated Operand Scanning). It is actually fairly straight forward to combine fast squaring with reduction in the CIOS layout; I have a working prototype that I will be submitting soon.

Not sure why I thought we needed to do the SOS method; probably I was thinking of the fast multiplication techniques (Karatsuba-Offman, Toom-Cook, etc.) for which that is true. Anyway, glad I was wrong! I'll update the description to reflect this.