flintlib / flint

FLINT (Fast Library for Number Theory)

Home Page:http://www.flintlib.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Meta-issue: make FLINT fast again

fredrik-johansson opened this issue · comments

There are lots of glaring inefficiencies in FLINT. Some kind of overview of possible optimizations might be useful.

General

  • Use pretransformed operands (FFT'ed, Toom'ed, preinverted...) all over the place.

  • Use the stack for more temporary allocations. When we need malloc, use a better implementation than the GNU one (tcmalloc or mimalloc or whatever is the state of the art these days).

  • More parallel algorithms.

Integers

  • Use inline few-limb arithmetic instead of GMP function calls in many places.

  • Use alternatives to the mpn format in appropriate situations:

    • ~60-bit limbs with delayed carries (vectorization opportunities)
    • ~50-bit limbs with floating-point FMA (even more vectorization opportunities, especially with AVX512-IFMA)
  • Use a flat byte-packed or limb-packed format for integer vectors (and by extension polys, matrices) instead of fmpz *. This would be more memory efficient when one has <1 limb or >= 2 limb entries, it would avoid the branching and memory allocation overheads of the fmpz type, and it would faciliate vectorization.

  • Redesign fmpz to point directly to limb data instead of having an mpz in the middle (I've tried this before and it was slower, but I'm not convinced that I did it right).

  • Faster single-limb and few-limb modular arithmetic.

  • FFT code designed for short operands (a few thousand bits up to the point where fft_small really becomes advantageous). The traditional way to do this is with a very tightly coded floating-point FFTs, but NTTs might be competitive on recent CPUs when implemented very carefully. Joris van der Hoeven says he has good results with a codelet approach.

  • Implement optimized CRT/multi_mod code and use everywhere. Vectorization opportunities when batched. Also improve the asymptotically fast code.

  • Faster integer GCD and other operations based on fft_small.

  • #924

  • Faster generation of prime numbers.

  • Faster integer factorization.

  • Improve fft_small

Linear algebra

  • #1508

  • #14

  • #1378

  • More efficient SIMD-vectorized basecase nmod linear algebra (see NTL).

  • #705

  • Faster LLL using the flatter algorithm + low-level optimizations. Note that flatter is now implemented in Pari/GP which may be helpful as a reference.

  • #900

  • #389

  • #710

  • #901

Polynomials

  • Improve operations based on fft_small

  • #1144

  • Use alternative FFT/convolution algorithms in some situations

  • Implement basecase, Karatsuba and FFT-based middle products and use in all appropriate places (e.g. Newton iteration).

  • Use optimal algorithms for polynomial division.

  • Many optimizations for power series functions.

  • Do multipoint evaluation and interpolation on arithmetic sequences via fast Newton basis conversions instead of the Lagrange formula (https://mathexp.eu/bostan/publications/BoSc05.pdf). Saves a constant factor, potentially also better over approximate rings (needs checking).

  • Related to the above, implement generic product trees and consider balancing them (maybe have both balanced and non-balanced versions).

  • Improve fmpz_poly_factor and related functions.

  • Use high-degree Toom multiplication in some cases.

  • #352 (largely solved by basing things on generics)

  • #676

  • #1177

  • #1210

Finite fields

  • Packed (8-bit/16-bit/32-bit...) representations.

Reals

  • Fixed-precision (machine precision and few-word) floats, balls and intervals.

  • Make all arb/acb operations adaptive to the output accuracy, not computing unneeded bits (similar to dot products)

  • Improve the block based polynomial and matrix multiplication (output-adaptivity, less overhead, tuning).

  • Polynomial root-finding can be improved in numerous ways (both complex and real roots).

  • Use fixed-point mpn code in more places.

  • Elementary functions can be sped up 2-4x using better series evaluation and optimized lookup tables.