owensgroup / merge-spmm

Code for paper "Design Principles for Sparse Matrix Multiplication on the GPU" accepted to Euro-Par 2018

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

question about merge-path SpMM implementation

shadow150519 opened this issue · comments

Hello, I am reading your paper and code and I have some problems about the merge-path SpMM.
(1) In my understanding, limit[i] meaning paritition i should start from row limit[i] of A, and end in row limit[i+1], but how do I find which element it should start, I can't understand what does end and row_ind mean in your alg.
(2)what is the granularity of merge-path SpMM? In row-split a wrap in responsible for a row in A. In your paper you say a thread is responsible for T works. Does T work means T nnz in A?
(3) How to finally reduce the partial sum. Since each thread is responsible for T work which might come from different rows in A?
image

Hi shadow150519,

(1) In my understanding, limit[i] meaning paritition i should start from row limit[i] of A, and end in row limit[i+1], but how do I find which element it should start, I can't understand what does end and row_ind mean in your alg.
Suppose you have a CSR array A.row_ptr = { 0, 957, 1008, 2000, 2312 }. Suppose you are trying to get each block to do 1000 calculations i.e. { 0, 1000, 2000 }. Then you should expect to get limits = { 0, 1, 3 } and offsets = {0, 43, 0} by running PartitionSpmm, which finds those values by binary searching (i.e. MergePath when it's run in parallel) using the second array over A.row_ptr [1]. This means Block 0 is responsible for Row 0, Block 1 for Row 1 and 2. Block 2 for Row 3. Now Block 1 (with A.row_ptr[1] 957) is correctly identified as the starting index it should use. Each thread of Block 1 will now know which values it has to look up.

This is not mentioned in the paper, but generally the look-up is done as follows i.e.
Block 1, Thread 0 loads A.col_ind[1000] (since limits[1] + offsets[1] + threadIdx.x = 957 + 43 + 0 = 1000) and saves it to shared memory
Block 1, Thread 1 loads A.col_ind[1001] (since limits[1] + offsets[1] + threadIdx.x = 957 + 43 + 1 = 1001) and saves it to shared memory
...
Block 1, Thread 31 loads A.col_ind[1031] and saves it to shared memory
Block 1, Thread 0 loads A.col_ind[1032] and saves it to shared memory
Block 1, Thread 1 loads A.col_ind[1033] and saves it to shared memory
...
Block 1, Thread 8 loads A.col_ind[2000] and saves it to shared memory
Block 1, Thread 9 saves 0 to shared memory
Block 1, Thread 10 saves 0 to shared memory
...
Block 1, Thread 31 saves 0 to shared memory

This is how that part works. For row_ind, you can think of it as threadIdx.x. For end, it is only important for the last Block, because that is the only Block where you are limited by how many nnz there are. In this example, the last Block needs to check that no entry exceeds 2312.

(2) what is the granularity of merge-path SpMM? In row-split a wrap in responsible for a row in A. In your paper you say a thread is responsible for T works. Does T work means T nnz in A?

On page 12, I mention that due to register limitations, in practice I had to keep T way too low i.e. T = 1 to keep the register count reasonable and not overflow.

(3) How to finally reduce the partial sum. Since each thread is responsible for T work which might come from different rows in A?

Each thread being responsible for T work is okay, because that gets resolved within each Block by doing a segmented reduce at the Block-level using shared memory. However, your point does become an issue across Blocks. That's why on line 22, ReduceToGlobalSpmm has to write both the partial results (C) and carry-outs (i.e. incomplete rows) to global memory. Then for long rows that were processed by more than 1 block, FixCarryout needs to be run in order to reconcile them using the partial results (C) and the carry-outs.

[1] MergePath paper: https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf

Hope that answers your questions!

Thank for your answer!I still have some questions.

Suppose you are trying to get each block to do 1000 calculations i.e. { 0, 1000, 2000 }. Then you should expect to get limits = { 0, 1, 3 } and offsets = {0, 43, 0} by running PartitionSpmm, which finds those values by binary searching (i.e. MergePath when it's run in parallel) using the second array over A.row_ptr [1].

I can't find the offset array in this function PartitionCsrSegReducePrealloc, so how to get the offset array?

what's more, what does tb means?

("tb", po::value<int>()->default_value(32), "B slab width")