owensgroup / merge-spmm

Hi all, I found a bug in reading symmetric mtx data.
For example, when I tried to run

./gspmm --debug=true --max_ncols=4 ./4_4coo_dense.mtx

to read 4x4 dense matrix from 4_4coo_dense.mtx.

It will load a broken matrix from the symmetric mtx.

Wrong results

%%MatrixMarket matrix coordinate real symmetric
%
4 4 10
1 1 1
2 1 1
2 2 1
3 1 1
3 2 1
3 3 1
4 1 1
4 2 1
4 3 1
4 4 1

ta: 32
tb: 32
nt: 128
row: 1
debug: 1
%%MatrixMarket matrix coordinate real symmetric
4 4 13
csrColInd:
[0]:0 [1]:1 [2]:2 [3]:3 [4]:0 [5]:2 [6]:3 [7]:0 [8]:1 [9]:3 [10]:0 [11]:1 [12]:2 [13]:0 [14]:4113 [15]:0 [16]:0 [17]:0 [18]:0 [19]:0 [20]:0 [21]:0 [22]:0 [23]:0 [24]:0 [25]:0 [26]:0 [27]:0 [28]:0 [29]:0 [30]:0 [31]:0 [32]:0 [33]:0 [34]:0 [35]:0 [36]:0 [37]:0 [38]:0 [39]:0
csrRowPtr:
[0]:0 [1]:4 [2]:7 [3]:10 [4]:13 [5]:14 [6]:81 [7]:0 [8]:0 [9]:0 [10]:0 [11]:0 [12]:1 [13]:1 [14]:1 [15]:2 [16]:2 [17]:2 [18]:3 [19]:3 [20]:3 [21]:35143 [22]:3 [23]:3 [24]:35143 [25]:-1456 [26]:81 [27]:0 [28]:0 [29]:1 [30]:2 [31]:3 [32]:0 [33]:2 [34]:3 [35]:0 [36]:1 [37]:3 [38]:0 [39]:1
csrVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:1 [6]:1 [7]:1 [8]:1 [9]:1 [10]:1 [11]:1 [12]:1 [13]:1 [14]:9.10844e-44 [15]:0 [16]:1.51901e-38 [17]:0 [18]:1.49695e-38 [19]:0 [20]:2.69808e-38 [21]:0 [22]:0 [23]:1.44118e+17 [24]:1.05553e+14 [25]:4.58715e-41 [26]:2.93874e-39 [27]:0 [28]:0 [29]:0 [30]:2.03188e-43 [31]:0 [32]:1.23145e+14 [33]:4.58715e-41 [34]:2.93874e-38 [35]:0 [36]:0 [37]:0 [38]:0 [39]:0
pretty print:
x x x x
x 0 x x
x x 0 x
x x x 0
mxm: 0.036416 ms
denseVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:0 [6]:1 [7]:1 [8]:1 [9]:1 [10]:0 [11]:1 [12]:1 [13]:1 [14]:1 [15]:0
x x x x
x 0 x x
x x 0 x
x x x 0
There were 0 errors out of 13.

Correct Results

%%MatrixMarket matrix coordinate real general
%
4 4 16
1 1 1
1 2 1
1 3 1
1 4 1
2 1 1
2 2 1
2 3 1
2 4 1
3 1 1
3 2 1
3 3 1
3 4 1
4 1 1
4 2 1
4 3 1
4 4 1

ta: 32
tb: 32
nt: 128
row: 1
debug: 1
%%MatrixMarket matrix coordinate real general
4 4 16
csrColInd:
[0]:0 [1]:1 [2]:2 [3]:3 [4]:0 [5]:1 [6]:2 [7]:3 [8]:0 [9]:1 [10]:2 [11]:3 [12]:0 [13]:1 [14]:2 [15]:3 [16]:35143 [17]:96 [18]:81 [19]:0 [20]:1065353216 [21]:1065353216 [22]:1065353216 [23]:1065353216 [24]:1065353216 [25]:1065353216 [26]:1065353216 [27]:1065353216 [28]:1065353216 [29]:1065353216 [30]:1065353216 [31]:1065353216 [32]:1065353216 [33]:1065353216 [34]:1065353216 [35]:1065353216 [36]:4 [37]:1065353216 [38]:81 [39]:0
csrRowPtr:
[0]:0 [1]:4 [2]:8 [3]:12 [4]:16 [5]:0 [6]:33 [7]:0 [8]:0 [9]:0 [10]:4482352 [11]:0 [12]:-1672595478 [13]:7953 [14]:33 [15]:0 [16]:42126000 [17]:0 [18]:42126096 [19]:0 [20]:32 [21]:0 [22]:49 [23]:0 [24]:0 [25]:0 [26]:34833056 [27]:0 [28]:42189248 [29]:0 [30]:1638970554 [31]:1868983913 [32]:203358240 [33]:7967 [34]:49 [35]:0 [36]:0 [37]:0 [38]:41772304 [39]:0
csrVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:1 [6]:1 [7]:1 [8]:1 [9]:1 [10]:1 [11]:1 [12]:1 [13]:1 [14]:1 [15]:1 [16]:5.60519e-45 [17]:1 [18]:1.13505e-43 [19]:0 [20]:0 [21]:0 [22]:1 [23]:1 [24]:1 [25]:1 [26]:1 [27]:1 [28]:1 [29]:1 [30]:1 [31]:1 [32]:1 [33]:1 [34]:1 [35]:1 [36]:2.95797e+17 [37]:4.58631e-41 [38]:2.70451e-43 [39]:0
pretty print:
x x x x
x x x x
x x x x
x x x x
mxm: 0.033792 ms
denseVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:1 [6]:1 [7]:1 [8]:1 [9]:1 [10]:1 [11]:1 [12]:1 [13]:1 [14]:1 [15]:1
x x x x
x x x x
x x x x
x x x x
There were 0 errors out of 16.

Hi @wyatuestc, I tried on the latest commit from master (e37dea0) and I was unable to reproduce the error you got. Here's what I got:

ctcyang@mario:~/merge-spmm/build$ bin/gspmm --debug=true --max_ncols=4 ../4_4coo_dense.mtx
ta:    32
tb:    32
nt:    128
row:   1
debug: 1
%%MatrixMarket matrix coordinate real general
4 4 16
csrColInd:
[0]:0 [1]:1 [2]:2 [3]:3 [4]:0 [5]:1 [6]:2 [7]:3 [8]:0 [9]:1 [10]:2 [11]:3 [12]:0 [13]:1 [14]:2 [15]:3 [16]:861104459 [17]:1414221919 [18]:81 [19]:0 [20]:1065353216 [21]:1065353216 [22]:1065353216 [23]:1065353216 [24]:1065353216 [25]:1065353216 [26]:1065353216 [27]:1065353216 [28]:1065353216 [29]:1065353216 [30]:1065353216 [31]:1065353216 [32]:1065353216 [33]:1065353216 [34]:1065353216 [35]:1065353216 [36]:4 [37]:1065353216 [38]:273 [39]:0
csrRowPtr:
[0]:0 [1]:4 [2]:8 [3]:12 [4]:16 [5]:0 [6]:33 [7]:0 [8]:40336064 [9]:0 [10]:40336496 [11]:0 [12]:0 [13]:0 [14]:49 [15]:0 [16]:40312448 [17]:0 [18]:1685221231 [19]:1952542313 [20]:1701978213 [21]:1730178145 [22]:1919250021 [23]:27745 [24]:1162162274 [25]:1768519237 [26]:81 [27]:0 [28]:0 [29]:0 [30]:0 [31]:0 [32]:1 [33]:1 [34]:1 [35]:1 [36]:2 [37]:2 [38]:2 [39]:2
csrVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:1 [6]:1 [7]:1 [8]:1 [9]:1 [10]:1 [11]:1 [12]:1 [13]:1 [14]:1 [15]:1 [16]:5.60519e-45 [17]:1 [18]:3.82554e-43 [19]:0 [20]:1.70078e-37 [21]:0 [22]:1.70092e-37 [23]:0 [24]:8.96831e-44 [25]:0 [26]:1.70057e-37 [27]:0 [28]:1.56318e-37 [29]:0 [30]:-7.03773e+13 [31]:4.56599e-41 [32]:-7.03773e+13 [33]:4.56599e-41 [34]:1.43493e-42 [35]:0 [36]:7.60905e-43 [37]:0 [38]:1.56322e-37 [39]:0
pretty print:
x x x x
x x x x
x x x x
x x x x
mxm: 0.044384 ms
denseVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:1 [6]:1 [7]:1 [8]:1 [9]:1 [10]:1 [11]:1 [12]:1 [13]:1 [14]:1 [15]:1
x x x x
x x x x
x x x x
x x x x
There were 0 errors out of 16.

Could you doublecheck you were on the latest master branch commit?

Hi Carel,
Thanks for the reply.
I mean that it can not read the "symmetric" mtx data rather than "general" data.
You can try to read this file

%%MatrixMarket matrix coordinate real symmetric
%
4 4 10
1 1 1
2 1 1
2 2 1
3 1 1
3 2 1
3 3 1
4 1 1
4 2 1
4 3 1
4 4 1

Thanks!
Yang

Hi @wyatuestc, thanks I confirmed that this is indeed a bug. It should be fixed in the new commit c147935. The reason is that as pointed out in this issue, the code will filter out the self-loops (i.e. elements on the diagonal). However, the bug caused it to fail to do so if the diagonal nonzero happened to be the 1 1 element (i.e. the first element). Now the code behaves as intended:

ctcyang@mario:~/merge-spmm/build$ bin/gspmm --debug=true --max_ncols=4 ../4sym_coo_dense.mtx
ta:    32
tb:    32
nt:    128
row:   1
debug: 1
%%MatrixMarket matrix coordinate real symmetric
4 4 12
csrColInd:
[0]:1 [1]:2 [2]:3 [3]:0 [4]:2 [5]:3 [6]:0 [7]:1 [8]:3 [9]:0 [10]:1 [11]:2 [12]:3 [13]:1065353216 [14]:65 [15]:0 [16]:1065353216 [17]:1065353216 [18]:1065353216 [19]:1065353216 [20]:1065353216 [21]:1065353216 [22]:1065353216 [23]:1065353216 [24]:1065353216 [25]:1065353216 [26]:1065353216 [27]:1065353216 [28]:7 [29]:1065353216 [30]:65 [31]:0 [32]:17620512 [33]:0 [34]:17483824 [35]:0 [36]:26246064 [37]:0 [38]:0 [39]:1543504138
csrRowPtr:
[0]:0 [1]:3 [2]:6 [3]:9 [4]:12 [5]:0 [6]:33 [7]:0 [8]:26245536 [9]:0 [10]:26245968 [11]:0 [12]:0 [13]:0 [14]:49 [15]:0 [16]:26222064 [17]:0 [18]:1685221231 [19]:1952542313 [20]:1701978213 [21]:1931504737 [22]:1701670265 [23]:1667854964 [24]:1162162176 [25]:1768519237 [26]:81 [27]:0 [28]:0 [29]:0 [30]:0 [31]:1 [32]:1 [33]:1 [34]:2 [35]:2 [36]:2 [37]:3 [38]:3 [39]:3
csrVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:1 [6]:1 [7]:1 [8]:1 [9]:1 [10]:1 [11]:1 [12]:9.80909e-45 [13]:1 [14]:9.10844e-44 [15]:0 [16]:2.58733e-38 [17]:0 [18]:2.54902e-38 [19]:0 [20]:5.30747e-38 [21]:0 [22]:0 [23]:1.4412e+17 [24]:7.03687e+13 [25]:4.56683e-41 [26]:2.93874e-39 [27]:0 [28]:0 [29]:0 [30]:2.03188e-43 [31]:0 [32]:4.37148e-38 [33]:0 [34]:0 [35]:0 [36]:7.03687e+13 [37]:4.56683e-41 [38]:2.93874e-39 [39]:0
pretty print:
0 x x x
x 0 x x
x x 0 x
x x x 0
mxm: 0.054528 ms
denseVal:
[0]:0 [1]:1 [2]:1 [3]:1 [4]:1 [5]:0 [6]:1 [7]:1 [8]:1 [9]:1 [10]:0 [11]:1 [12]:1 [13]:1 [14]:1 [15]:0
0 x x x
x 0 x x
x x 0 x
x x x 0
There were 0 errors out of 12.

As pointed out in the issue, if you don't want it to filter out diagonal elements, you will have to set to false

merge-spmm/graphblas/util.hpp

Line 220 in 5b84296

bool remove_self_loops=true ) {

Then the result will be the same as the general case:

ctcyang@mario:~/merge-spmm/build$ bin/gspmm --debug=true --max_ncols=4 ../4sym_coo_dense.mtx
ta:    32
tb:    32
nt:    128
row:   1
debug: 1
%%MatrixMarket matrix coordinate real symmetric
4 4 16
csrColInd:
[0]:0 [1]:1 [2]:2 [3]:3 [4]:0 [5]:1 [6]:2 [7]:3 [8]:0 [9]:1 [10]:2 [11]:3 [12]:0 [13]:1 [14]:2 [15]:3 [16]:861104459 [17]:1414221919 [18]:81 [19]:0 [20]:1065353216 [21]:1065353216 [22]:1065353216 [23]:1065353216 [24]:1065353216 [25]:1065353216 [26]:1065353216 [27]:1065353216 [28]:1065353216 [29]:1065353216 [30]:1065353216 [31]:1065353216 [32]:1065353216 [33]:1065353216 [34]:1065353216 [35]:1065353216 [36]:4 [37]:1065353216 [38]:273 [39]:0
csrRowPtr:
[0]:0 [1]:4 [2]:8 [3]:12 [4]:16 [5]:0 [6]:33 [7]:0 [8]:27720240 [9]:0 [10]:27720672 [11]:0 [12]:0 [13]:0 [14]:49 [15]:0 [16]:27696624 [17]:0 [18]:1685221231 [19]:1952542313 [20]:1701978213 [21]:1931504737 [22]:1701670265 [23]:1667854964 [24]:1162162176 [25]:1768519237 [26]:81 [27]:0 [28]:0 [29]:0 [30]:0 [31]:0 [32]:1 [33]:1 [34]:1 [35]:1 [36]:2 [37]:2 [38]:2 [39]:2
csrVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:1 [6]:1 [7]:1 [8]:1 [9]:1 [10]:1 [11]:1 [12]:1 [13]:1 [14]:1 [15]:1 [16]:5.60519e-45 [17]:1 [18]:3.82554e-43 [19]:0 [20]:6.13444e-38 [21]:0 [22]:6.13518e-38 [23]:0 [24]:8.96831e-44 [25]:0 [26]:6.13342e-38 [27]:0 [28]:5.44648e-38 [29]:0 [30]:4.00049 [31]:4.56151e-41 [32]:4.00049 [33]:4.56151e-41 [34]:1.43493e-42 [35]:0 [36]:7.60905e-43 [37]:0 [38]:5.44664e-38 [39]:0
pretty print:
x x x x
x x x x
x x x x
x x x x
mxm: 0.046368 ms
denseVal:
[0]:1 [1]:1 [2]:1 [3]:1 [4]:1 [5]:1 [6]:1 [7]:1 [8]:1 [9]:1 [10]:1 [11]:1 [12]:1 [13]:1 [14]:1 [15]:1
x x x x
x x x x
x x x x
x x x x
There were 0 errors out of 16.

Thanks!
BTW, is it possible to execute these codes on newer architecture GPUs (turing/volta/pascal) ?

I've tested on Volta and Pascal and it works, but have not tested Turing. It should work for Turing in theory, or at least with minimal modification.

Thanks! I'm running gspmm on RTX2080 (Turing).
I found that the correctness of gspmm depends on the sharp of the matrix.
For example, it worked well on a 4x4 matrix but crashed on a 4x32 matrix.
I also execute gspmm on some square matrices, and I found it cannot work on some matrices (512x512, 1024x1024).
I'm not sure whether it is related to GPU arch or some corner cases in source codes.
Thanks!
Yang

4x4 Correct
4x32 Wrong
16x16 Correct
32x32 Correct
64x64 Correct
128x128 Correct
256x256 Correct
512x512 Wrong
1024x1024 Wrong

command:

./bin/gspmm --debug=true --mode="mergepath" ./[matrix].mtx

data:
data.zip

Sorry for the slow response. Thank you so much for bringing this to my attention, Yang! I tried your datasets, and 4 x 32/1024 x 1024 are indeed wrong. I tested on Tesla V GPU with 12GB memory.

I tracked down the 4 x 32 error to an incorrect assumption in the test file gspmm.cu where I assumed square matrices. Therefore, only the first A.nrows of the dense B matrix was initialized correctly. This should be fixed in commit 100ddca. Please see the diff here: 100ddca

Need some more time to investigate 1024 x 1024 error.

Hi @YangWang92, thanks for pointing out the error. If you use the command: bin/gspmm --debug 1 --mode="mergepath" --nt=512 --iter=10 dataset/data/1024_1024coo_dense.mtx, it gives the correct solution.

Still investigating why this value for nt is the magic number. I suspect it has to do with how the number of blocks is calculated.

Bugs about reading symmetric mtx data and processing data