About Sparse Kernel Generator in TorchSparse++ paper

Question

About Sparse Kernel Generator in TorchSparse++ paper

99DHL opened this issue 7 months ago · comments

Thank you for your great work!
I came across your work TorchSparse++(MICRO '23) and really enjoyed reading your paper. I have several questions about the sparse kernel generator introduced in the paper.
According to the paper, sparse kernel generator is a code generator that integrates on-chip MMA subroutines from TVM directly at the source code level. I am curious how this is possible. Could you provide more details about this code generator? Which part of the kernel is auto-generated and which part is hand written? Do I have to give some changes to TVM to get on-chip MMA subroutines that can be used at the source code level (CUDA level)? If so, could you provide us the implementation of your code generator?

Zhijian Liu · Answer 1 · Mon Dec 11 2023 12:00:06 GMT+0800 (China Standard Time)

@ys-2020, could you please take a look at this problem when you have time? Thanks!

Shang Yang · Answer 2 · Tue Dec 12 2023 00:08:17 GMT+0800 (China Standard Time)

Hi @99DHL , thank you very much for your interest! We use the TVM GEMM template to get the on-chip MMA subroutines, which are corresponding to L159 - L245 in kernel. Starting from the MMA subroutines, we rewrote the DRAM access pointers rather than the MMA instructions to support sparse convolution.

99DHL · Answer 3 · Tue Dec 12 2023 00:47:42 GMT+0800 (China Standard Time)

Thank you for your kind response!
If possible, could you please provide information on the tvm GEMM template you used, such as relevant links or pages?

Shang Yang · Answer 4 · Tue Dec 12 2023 03:20:58 GMT+0800 (China Standard Time)

Hi! I think we just followed the tvm documents to write the GEMM template and generate PTX for mma subroutines. The major logic of our conv kernel is re-designed.

Shang Yang · Answer 5 · Sat Dec 30 2023 22:36:24 GMT+0800 (China Standard Time)

Close this issue as completed. Feel free to reopen it if you have any further questions.

Tianao Ge · Answer 6 · Fri Feb 23 2024 07:58:05 GMT+0800 (China Standard Time)

@ys-2020 Sorry for jumping in on a closed issue, but I wanted to ask about something related. Does the fetch-on-demand dataflow also work with the on-chip MMA subroutines generated from TVM? I noticed the paper mentioned that "Similar analysis and code transformation can also be applied to the fetch-on-demand dataflow," but it seems like only the implicit GEMM implementation uses the generated kernel.