Question about discrepancy between implementations available in the repo and related papers
sylee0124 opened this issue · comments
Hi, I'm bit confused about current implementations of the repo and implementations used/discussed in related papers. I'll just state what I think is true. Please correct me if I'm wrong.
-
Flashconv from h3
Fused kernel is implemented at fftconv_cuda.cu but it is not using block FFT. -
FlashButterfly in "Simple Hardware-Efficient Long Convolutions for Sequence Modeling"
long_conv.py uses BlockFFT (which is same as Butterfly Decomposition) with support for learnable parameters for dft_matrix. But not using fused kernel and Three-pass algorithm is also not implemented.
Thanks for verifying :)
When can I expect this performance update? Will it happen anytime soon?