Meta-issue: make FLINT fast again
fredrik-johansson opened this issue · comments
There are lots of glaring inefficiencies in FLINT. Some kind of overview of possible optimizations might be useful.
General
-
Use pretransformed operands (FFT'ed, Toom'ed, preinverted...) all over the place.
-
Use the stack for more temporary allocations. When we need malloc, use a better implementation than the GNU one (tcmalloc or mimalloc or whatever is the state of the art these days).
-
More parallel algorithms.
Integers
-
Use inline few-limb arithmetic instead of GMP function calls in many places.
-
Use alternatives to the
mpn
format in appropriate situations:- ~60-bit limbs with delayed carries (vectorization opportunities)
- ~50-bit limbs with floating-point FMA (even more vectorization opportunities, especially with AVX512-IFMA)
-
Use a flat byte-packed or limb-packed format for integer vectors (and by extension polys, matrices) instead of
fmpz *
. This would be more memory efficient when one has <1 limb or >= 2 limb entries, it would avoid the branching and memory allocation overheads of thefmpz
type, and it would faciliate vectorization. -
Redesign
fmpz
to point directly to limb data instead of having anmpz
in the middle (I've tried this before and it was slower, but I'm not convinced that I did it right). -
Faster single-limb and few-limb modular arithmetic.
-
FFT code designed for short operands (a few thousand bits up to the point where
fft_small
really becomes advantageous). The traditional way to do this is with a very tightly coded floating-point FFTs, but NTTs might be competitive on recent CPUs when implemented very carefully. Joris van der Hoeven says he has good results with a codelet approach. -
Implement optimized CRT/multi_mod code and use everywhere. Vectorization opportunities when batched. Also improve the asymptotically fast code.
-
Faster integer GCD and other operations based on
fft_small
. -
Faster generation of prime numbers.
-
Faster integer factorization.
-
Improve
fft_small
Linear algebra
-
More efficient SIMD-vectorized basecase nmod linear algebra (see NTL).
-
Faster LLL using the flatter algorithm + low-level optimizations. Note that flatter is now implemented in Pari/GP which may be helpful as a reference.
Polynomials
-
Improve operations based on
fft_small
-
Use alternative FFT/convolution algorithms in some situations
-
Implement basecase, Karatsuba and FFT-based middle products and use in all appropriate places (e.g. Newton iteration).
-
Use optimal algorithms for polynomial division.
-
Many optimizations for power series functions.
-
Do multipoint evaluation and interpolation on arithmetic sequences via fast Newton basis conversions instead of the Lagrange formula (https://mathexp.eu/bostan/publications/BoSc05.pdf). Saves a constant factor, potentially also better over approximate rings (needs checking).
-
Related to the above, implement generic product trees and consider balancing them (maybe have both balanced and non-balanced versions).
-
Improve
fmpz_poly_factor
and related functions. -
Use high-degree Toom multiplication in some cases.
-
#352 (largely solved by basing things on generics)
Finite fields
- Packed (8-bit/16-bit/32-bit...) representations.
Reals
-
Fixed-precision (machine precision and few-word) floats, balls and intervals.
-
Make all arb/acb operations adaptive to the output accuracy, not computing unneeded bits (similar to dot products)
-
Improve the block based polynomial and matrix multiplication (output-adaptivity, less overhead, tuning).
-
Polynomial root-finding can be improved in numerous ways (both complex and real roots).
-
Use fixed-point mpn code in more places.
-
Elementary functions can be sped up 2-4x using better series evaluation and optimized lookup tables.