Meta-issue: make FLINT fast again

Question

Meta-issue: make FLINT fast again

fredrik-johansson opened this issue 7 months ago · comments

Fredrik Johansson commented 7 months ago

There are lots of glaring inefficiencies in FLINT. Some kind of overview of possible optimizations might be useful.

General

Use pretransformed operands (FFT'ed, Toom'ed, preinverted...) all over the place.
- #1512
- #1371
Use the stack for more temporary allocations. When we need malloc, use a better implementation than the GNU one (tcmalloc or mimalloc or whatever is the state of the art these days).
More parallel algorithms.

Integers

Use inline few-limb arithmetic instead of GMP function calls in many places.
Use alternatives to the mpn format in appropriate situations:
- ~60-bit limbs with delayed carries (vectorization opportunities)
- ~50-bit limbs with floating-point FMA (even more vectorization opportunities, especially with AVX512-IFMA)
Use a flat byte-packed or limb-packed format for integer vectors (and by extension polys, matrices) instead of fmpz *. This would be more memory efficient when one has <1 limb or >= 2 limb entries, it would avoid the branching and memory allocation overheads of the fmpz type, and it would faciliate vectorization.
Redesign fmpz to point directly to limb data instead of having an mpz in the middle (I've tried this before and it was slower, but I'm not convinced that I did it right).
Faster single-limb and few-limb modular arithmetic.
- #1010
- #1153
FFT code designed for short operands (a few thousand bits up to the point where fft_small really becomes advantageous). The traditional way to do this is with a very tightly coded floating-point FFTs, but NTTs might be competitive on recent CPUs when implemented very carefully. Joris van der Hoeven says he has good results with a codelet approach.
Implement optimized CRT/multi_mod code and use everywhere. Vectorization opportunities when batched. Also improve the asymptotically fast code.
- #1435
Faster integer GCD and other operations based on fft_small.
- #1365
#924
Faster generation of prime numbers.
- #1021
Faster integer factorization.
- #657
- #616
Improve fft_small
- #1369
- #1648

Linear algebra

#1508
#14
#1378
More efficient SIMD-vectorized basecase nmod linear algebra (see NTL).
#705
Faster LLL using the flatter algorithm + low-level optimizations. Note that flatter is now implemented in Pari/GP which may be helpful as a reference.
- #1407
- #701
#900
#389
#710
#901

Polynomials

Improve operations based on fft_small
- #1376
- #1371
- #925
#1144
Use alternative FFT/convolution algorithms in some situations
- #18
- #1362
Implement basecase, Karatsuba and FFT-based middle products and use in all appropriate places (e.g. Newton iteration).
Use optimal algorithms for polynomial division.
- #1387
- #1411
- #1649
- #926
- #707
Many optimizations for power series functions.
- #938
Do multipoint evaluation and interpolation on arithmetic sequences via fast Newton basis conversions instead of the Lagrange formula (https://mathexp.eu/bostan/publications/BoSc05.pdf). Saves a constant factor, potentially also better over approximate rings (needs checking).
Related to the above, implement generic product trees and consider balancing them (maybe have both balanced and non-balanced versions).
- #21
Improve fmpz_poly_factor and related functions.
- #907
- #765
Use high-degree Toom multiplication in some cases.
- #1429
#352 (largely solved by basing things on generics)
#676
#1177
#1210

Finite fields

Packed (8-bit/16-bit/32-bit...) representations.

Reals

Fixed-precision (machine precision and few-word) floats, balls and intervals.
Make all arb/acb operations adaptive to the output accuracy, not computing unneeded bits (similar to dot products)
Improve the block based polynomial and matrix multiplication (output-adaptivity, less overhead, tuning).
Polynomial root-finding can be improved in numerous ways (both complex and real roots).
Use fixed-point mpn code in more places.
Elementary functions can be sped up 2-4x using better series evaluation and optimized lookup tables.