mfrewer/mpFFT

mpFFT

Multiprecision Fast Fourier Transform

Synopsis

mpFFT is an open-source project to implement a high-performance multiprecision Fast Fourier Transform that can compete with non-free software as Mathematica and MATLAB, in both serial and parallel computations.

The need for such a project is clear: The FFT routine is one of the main workhorses in scientific computing and usually runs as a subroutine in a larger program. Therefore, if a multiprecision computation or result is needed, it is natural to have an efficient and fast implementation that can be easily embedded into an existing project without requiring a redesign or migration onto a new (mostly non-free) platform.

Many fast open-source FFT packages are available for single- and double-precision, such as the projects from Ooura and GSL for scalar implementations, or, for the significantly faster vectorised implementations, the projects FFTW, FFTE and FFTS, just to name a few popular ones. For multiprecision computing, however, the need for fast FFT packages has not yet been met and great potential still exists. Especially since the demand for multiprecision calculations has grown in recent years [Higham (2017)]. One of the issues, for example, is that many scientific computing problems on closer inspection turn out to be hybrid, in that certain parts of the problem can be solved with lower precision, while other parts need higher precision.

To note is that this project is still in development and far from complete. However, the first implementations available already show promising results: In serial computation with varying precision size (up to 1 million digits), a more than 2x speed-up to Mathematica and a 10x speed-up to MATLAB is achieved, where for MATLAB an external toolbox had to be used, the for FFT only available non-free multiprecision toolbox ADVANPIX. Also for parallel computations with 4 CPU cores, a 10x speed-up to MATLAB is achieved for high precision orders — for Mathematica, unfortunately, no comparison is possible, as it does not provide any multicore support for its FFT routine.

Design

mpFFT.c is the main program that selects the currently most efficient multiprecision implementations developed in lab for serial and parallel computation. It is the result of an ongoing development in lab to construct a high-performance multiprecision FFT algorithm through small independent modules, so-called codes *_mp_fft*.c, each of which can be compiled and tested separately. This allows for a better and finer control over which implementations and multiprecision arithmetic schemes are performance-hindering and which are performance-enhancing.

There will be a choice between split-radix algorithms of different orders, unscaled and scaled, different hard-coded base-case FFTs to terminate recursion, two different schemes of complex multiplication (4m2a, 3m3a), different indexing schemes of the twiddle-factor lookup tables, and finally the choice of choosing either 1, 2 or 3 butterfly operations. All these options will first be studied in double-precision, with the most efficient ones then serving as templates for the multiprecision implementations.

For parallel computations there will be an additional choice between OpenMP and a self-designed threadpool using the gcc library pthread.h.

Implementation

All codes in this project will be:

only 1D: The reason for this is to first find the most efficient FFT algorithm and implementation that suits multiprecision computation best, and this is naturally to be done in 1D. Once it is found, it can then form the basis for all higher-dimensional FFTs, since they are all just iterated 1D FFTs. To achieve efficient implementations, however, the problem then inevitably shifts to the parallelization issue where the main performance bottleneck in higher dimensions is the communication among the CPUs. Nevertheless, a fast 3D FFT relies on a fast 1D FFT.
only scalar: Vectorised instruction sets such as SIMD, which are relevant for single- and double-precision, lose their importance when calculating with multiprecision numbers going far beyond 64- or 128-bit, since the requested jobs are too large to fit into the CPU caches. The arithmetic efficiency of this project fully relies on the implementation of the underlying GNU library MPFR, which will be used throughout to perform all multiprecision computations.
only 2ⁿ: Input signal lengths of power 2 will be considered only.
only DIT: The FFT approach throughout this project will be based on the decimation-in-time (DIT) decomposition, where the indices of the input sequence are separated into even and odd classes. The dual approach, the decimation-in-frequency (DIF) where the output sequence is divided, will not be considered.
only split-radix: The reason for choosing the split-radix algorithm is the advantage of having low complexity, since it aims to compute the FFT with the least number of multiplications. For single- and double-precision computations the complexity-issue is not so much of a concern, but becomes highly relevant for multiprecision computations, as multiplications become increasingly more expensive than additions the higher the order of precision gets. Different classes and variations of the split-radix algorithm will be implemented and tested, to find for multiprecision the most efficient one.
only recursive: The reason for choosing a recursive rather than an iterative scheme is the advantage of memory locality [Frigo & Johnson (1998)], which is a critical component for fast multiprecision computations. In addition, a recursive first-depth scheme eliminates the need for the computationally expensive bit-reversal permutation in the indices, since the recursion already implicitly performs the permutation.

Overall, the recursive DIT-split-radix implementation here follows the notation and derivation as concisely and directly given in Wikipedia: Split-radix FFT algorithm.
only complex-split format: The real and imaginary part of complex data will be stored in separate arrays. The interleaved format, where the real and imaginary part are stored adjacently through a single array in memory, is not used.
only out-of-place: Two different arrays in and out are used for input and output. The input is not overwritten by the output as the program executes. Due to the complex-split format used, the input and output are thus controlled by four separate arrays: inr, ini for the real and imaginary input, and outr, outi for the real and imaginary output.
only complex input: Complex-to-complex FFTs will be considered only, where both input arrays inr and ini are non-zero. To generate efficient real-to-complex FFTs, where ini=0, is, in the setting of this project, straightforward to achieve: Due to the DIT-split-radix and complex-split format used, the input array ini needs only to be removed from the code, while operations for the complex output array X_k:=outr[k]+I*outi[k] can be easily reduced into half, by recognizing the redundancy in the computation of the elements X_k+2N/4 = (X_{(N/4‑k)+N/4})* and X_k+3N/4 = (X_N/4-k)*, i.e., X_k+2N/4 and X_k+3N/4 can be determined from X_k+1N/4 and X_k+0N/4, respectively, due to the Hermitian symmetry of the discrete Fourier Transform (DFT) for real input: X_k = (X_N-k)*, where 0≤k<N/4, and N the input length and * symbolizing the complex conjugate. This holds both for the classical and conjugate-pair split-radix algorithm.

Note that while the decimation-in-time (DIT) approach is the natural choice for the real-to-complex FFT, the dual decimation-in-frequency (DIF) approach is the natural choice for the inverse complex-to-real FFT [Sorensen et al. (1987)].
only forward direction: All FFTs are performed here only in forward direction, with the weight factor (root-of-unity) defined as ω_N=e^-⁠2πi/N. The inverse FFT can be obtained straightforwardly from the forward FFT without any additional computational cost, when efficiently implemented as presented in Duhamel et al. (1988). The inverse FFT will be used herein for error analysis.
arbitrary precision: No restrictions will be imposed on the order of precision; it can be set arbitrarily. The upper (binary) precision limit is dictated through long int MPFR_PREC_MAX, and on a 64-bit machine given by: (2^64-⁠1-⁠1)-⁠2⁸, which is about 9·10¹⁸. However, since MPFR needs to increase the precision internally, in order to provide accurate results and correct rounding, it is not recommended to set the precision to any value near MPFR_PREC_MAX.

Compiling

All builds used gcc (v9.3.0), with full optimization flag ‑O3, and static linking ‑lmpfr ‑lgmp ‑lm to the GNU libraries MPFR (v4.1.0), GMP (v6.2.1) and the standard C math-library. For parallel computations the extra flags ‑pthread ‑fopenmp need to be included.

Make sure that MPFR is installed 'thread-safe' in order to run correctly and reliably in parallel. This is done by setting the option ‑‑enable‑thread‑safe during installation.

In the current stage, the project's main code can be compiled and run with
gcc -O3 mpFFT.c -fopenmp -lmpfr -lgmp -lm -o mpFFT && ./mpFFT

To run a code in lab, it is divided into several categories

each containing a *_main.c file which can be compiled independently from all other ones along with the project's header file mpFFT.h. Inside each category different implementations and inputs can be chosen and selected in *_main.c through the macros #define CODE and #define IN. For more details, please see the accompanying OPTIONS- and README-files.

Documentation

For further information and more details on particular implementations and their benchmarks, please see the other README-files in each of the different categories of this project. A detailed LaTeX-written documentation on all theoretical aspects and mathematical underpinnings of this project will follow in due course.

License

The content of this project itself is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License, and all underlying source codes to compile and to run this project are licensed under the Apache-2.0 License.

Other multiprecision FFT implementations

mpfft: MPFR FFT radix-2 functions in C: https://github.com/urrfinjuss/mpfft

mfrewer / mpFFT