mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.

Home Page:https://torchsparse.mit.edu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About Sparse Kernel Generator in TorchSparse++ paper

99DHL opened this issue · comments

Thank you for your great work!
I came across your work TorchSparse++(MICRO '23) and really enjoyed reading your paper. I have several questions about the sparse kernel generator introduced in the paper.
According to the paper, sparse kernel generator is a code generator that integrates on-chip MMA subroutines from TVM directly at the source code level. I am curious how this is possible. Could you provide more details about this code generator? Which part of the kernel is auto-generated and which part is hand written? Do I have to give some changes to TVM to get on-chip MMA subroutines that can be used at the source code level (CUDA level)? If so, could you provide us the implementation of your code generator?

@ys-2020, could you please take a look at this problem when you have time? Thanks!

Hi @99DHL , thank you very much for your interest! We use the TVM GEMM template to get the on-chip MMA subroutines, which are corresponding to L159 - L245 in kernel. Starting from the MMA subroutines, we rewrote the DRAM access pointers rather than the MMA instructions to support sparse convolution.

Thank you for your kind response!
If possible, could you please provide information on the tvm GEMM template you used, such as relevant links or pages?

Hi! I think we just followed the tvm documents to write the GEMM template and generate PTX for mma subroutines. The major logic of our conv kernel is re-designed.

Close this issue as completed. Feel free to reopen it if you have any further questions.

@ys-2020 Sorry for jumping in on a closed issue, but I wanted to ask about something related. Does the fetch-on-demand dataflow also work with the on-chip MMA subroutines generated from TVM? I noticed the paper mentioned that "Similar analysis and code transformation can also be applied to the fetch-on-demand dataflow," but it seems like only the implicit GEMM implementation uses the generated kernel.