use FFHT code

Question

use FFHT code

stevengj opened this issue 6 years ago · comments

The https://github.com/FALCONN-LIB/FFHT repository implements a specialized SIMD FHT, which it says is faster than FFTW for this problem. (Totally possible, since 2×2×2×⋯ multidimensional transforms are not that optimized in FFTW — they are using size-2 DFT kernels, whereas to be fast they should really use larger multidimensional kernels.) Might be interesting to support using FFHT (which is MIT-licensed) in cases where it is applicable.

Deleted user · Answer 1 · Wed Aug 08 2018 09:23:22 GMT+0800 (China Standard Time)

I'll compare that with my SSE2 SIMD code, thanks for the link. There are a bunch of things you can do with the WHT that are not well known. Mainly stemming from fast random projection algorithms based on the WHT:
https://github.com/S6Regen?tab=repositories
It also seems the very simple out of place WHT is hardly known, where you go through the input vector pair-wise sequentially and put the sum of each pair sequentially in the lower half of a new array, and the difference sequentially in the upper half of the new array. Repeat for a total log base 2 n times. That is the simplest algorithm in computer science least known.
http://randomprojectionai.blogspot.com/

Deleted user · Answer 2 · Fri Aug 10 2018 08:01:16 GMT+0800 (China Standard Time)

When I ran the the FFHT bench-marking code on my cheap laptop I got 5200 65536-point WHT's per second per core. In my code I get 5000 per core.

My legacy SSE code: https://drive.google.com/open?id=1bhQi4BybXPkvCENIYQcdw9PHAwS9Ocam

I see all the links I have given are broken if you just click on them. It seems you have to copy and paste them to the browser address line for them to work!

Anyway to get GPU comparable results you would definitely need say one of the new AMD 32 core CPUs. From the FFHT result tables that would get you to around 1,000,000 65536-point WHT's per second. A big advantage of CPUs is they are so much easier to program than GPUs. I don't know what the situation is with the new GPU tensor cores. If you could use them for the WHT you might reach 10 million+ 65536-point WHTs per second presuming the memory bandwidth could handle that. That would be 1000 times better than I can do on my 2 core laptop (no harddrive, broken keyboard, CPU heatsink problems, runs off cheap usb stick etc.) I could evolve a neural network in 86.4 seconds that takes me a day at present.

Djoop · Answer 3 · Fri Nov 23 2018 21:59:29 GMT+0800 (China Standard Time)

I did a quick benchmark for the 1D case, it's probably not perfect but suggests that FFHT is one order of magnitude faster, even a bit more in low dimension. Here are the results (Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz).

I create each time a matrix A=randn(d,1000), and compare @belapsed fwht_natural!($(A),dims=1) and running fht_double on each column of A (with a simple loop, one can probably parallelize it.). They also provide an out-of-place version, which I guess is better than first copying.
For reference, there is also the Spiral version (http://www.spiral.net/software/wht.html), and this other SIMD C++ implementation : https://github.com/curto2/mckernel/ , but I didn't have time to compare.
So I don't know what is the best option : using FFHT only, using FFHT+FFTW for unsuported cases (maybe multidimensional cases?), or even creating a different package for FFHT?

Steven G. Johnson · Answer 4 · Sat Nov 24 2018 21:49:12 GMT+0800 (China Standard Time)

I would continue to use FFTW (because it probably handles more cases), and add FFHT to use it for the cases it supports. Would need to first make a BinaryBuilder package of FFHT.