tromp / cuckoo

a memory-bound graph-theoretic proof-of-work system

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

atomicAdd problem

zkh2018 opened this issue · comments

hi tromp, I get a problem with atomicAdd on cuckaroo mean:
file : cuckaroo: mean.cu:
environment: ubuntu18.04, cuda10, 1080ti
please attempt follow steps:

  1. add code after line 297:
    printf("%u %u %u\n", bucket, bktIdx, dstIdx[bucket]);
  2. add code after line 483:
cudaDeviceSynchronize();
return 0;
  1. make cuda29
  2. ./cuda29 > tmp.txt
  3. wait run or ctrl + c
  4. sort tmp.txt > tmp2.txt
  5. get the same bucket , bktIdx, dstIdx:
0 3 4
0 3 4
1000 14 15
1000 14 15
....
....

I think I wont get the same bucket, bktIdx, dstIdx[bucket] if the atomicAdd is no bug. But the mean can find a cycle. So, I don't how to understand this result.

buffer[((u64)grp * maxOut + cnt + i) / TMPPERLL4] = *(ulonglong4 *)(&tmp[row][i]);

the another question :

grp = row * NX + col, the grp may has same value at different block, perhaps cnt=maxOut-nflush at the same time, maybe this probability is very low. But if grp and cnt are same at different block thread, I don't know what happen when cuda multi thread no in a warp competitive write same address.

yes, but only run round once, like this:

for (u32 i = NB; i--; ) {
Round<1, EDGES_A, EDGES_B/NB><<<tp.trim.blocks/NB, tp.trim.tpb>>>(0,
(uint2*)(bufferA+i*qA), (uint2*)(bufferB+i*qB), indexesE[0]+i*qE,
indexesE[1+i]); // to .632

cudaDeviceSynchronize();
return 0;

if (abort) return false;
}

grp = row * NX + col, the grp may has same value at different block, perhaps cnt=maxOut-nflush at the same time, maybe this probability is very low. But if grp and cnt are same at different block thread, I don't know what happen when cuda multi thread no in a warp competitive write same address.

Yes, with some tiny probability, multiple threads will try to write same location at end of a full bucket. In that case, we presume that the final value is one of the written values.

have different dstIdx

I meant different dst. Can you print all of

printf("NP=%u round=%u dst=%llx %u %u %u\n", NP, round, (u64)dst, bucket, bktIdx,
dstIdx[bucket]);

and show me the duplicates you get?
Sorry, I have limited access to a GPU card myself.

hi tromp, I just try it , and still get the repeat result:

296           const int bucket = node1 >> ZBITS;
297           const int bktIdx = min(atomicAdd(dstIdx + bucket, 1), maxOut - 1);
298           printf("NP=%u round=%u dst=%llx %u %u %u\n", NP, round, (u64)dst, bucket, bktIdx,dstIdx[bucket]);
299           dst[bucket * maxOut + bktIdx] = (round&1) ? make_uint2(node1, node0) : make_uint2(node0, node1);
482     for (u32 i = NB; i--; ) {
483       Round<1, EDGES_A, EDGES_B/NB><<<tp.trim.blocks/NB, tp.trim.tpb>>>(0, (uint2*)(bufferA+i*qA), (uint2*)(bufferB+i*qB), indexesE[0]+i*qE, indexesE[1+i]); // to .632
484       cudaDeviceSynchronize();
485       return 0;
486       if (abort) return false;
487     }
GeForce GTX 1080 Ti with 11177MB @ 352 bits x 5505MHz
Looking for 42-cycle on cuckaroo29("",0) with 50% edges, 64*64 buckets, 176 trims, and 64 thread blocks.
nonce 0 k0 k1 k2 k3 a34c6a2bdaa03a14 d736650ae53eee9e 9a22f05e3bffed5e b8d55478fa3a606d
NP=1 round=0 dst=7fe50e000000 0 10 11
NP=1 round=0 dst=7fe50e000000 0 11 12
NP=1 round=0 dst=7fe50e000000 0 1 2
NP=1 round=0 dst=7fe50e000000 0 12 13
NP=1 round=0 dst=7fe50e000000 0 13 14
NP=1 round=0 dst=7fe50e000000 0 2 3
NP=1 round=0 dst=7fe50e000000 0 3 4
NP=1 round=0 dst=7fe50e000000 0 5 6
NP=1 round=0 dst=7fe50e000000 0 5 6
NP=1 round=0 dst=7fe50e000000 0 6 7
NP=1 round=0 dst=7fe50e000000 0 8 9
NP=1 round=0 dst=7fe50e000000 0 9 10
NP=1 round=0 dst=7fe50e000000 0 9 10
NP=1 round=0 dst=7fe50e000000 1000 0 1
NP=1 round=0 dst=7fe50e000000 100 0 1
NP=1 round=0 dst=7fe50e000000 1000 10 11
NP=1 round=0 dst=7fe50e000000 1000 10 11
NP=1 round=0 dst=7fe50e000000 1000 11 12
NP=1 round=0 dst=7fe50e000000 1000 1 2
NP=1 round=0 dst=7fe50e000000 1000 13 14
NP=1 round=0 dst=7fe50e000000 1000 14 15
NP=1 round=0 dst=7fe50e000000 1000 15 16
NP=1 round=0 dst=7fe50e000000 1000 16 17
NP=1 round=0 dst=7fe50e000000 1000 16 17
NP=1 round=0 dst=7fe50e000000 1000 2 3
NP=1 round=0 dst=7fe50e000000 1000 3 4
NP=1 round=0 dst=7fe50e000000 1000 3 4
NP=1 round=0 dst=7fe50e000000 1000 5 6
NP=1 round=0 dst=7fe50e000000 1000 7 8
NP=1 round=0 dst=7fe50e000000 1000 8 9
NP=1 round=0 dst=7fe50e000000 1000 9 10
NP=1 round=0 dst=7fe50e000000 1000 9 10
NP=1 round=0 dst=7fe50e000000 100 10 11
NP=1 round=0 dst=7fe50e000000 1001 0 2
NP=1 round=0 dst=7fe50e000000 1001 10 11
NP=1 round=0 dst=7fe50e000000 100 11 12
...

I cannot reproduce that.
To limit the amount of output (I find that huge amounts do not get transferred correctly) I did

  if (bucket==0 && bktIdx<16 && dstIdx[bucket]<16)
        printf("NP=%u round=%u group=%u lid=%u dst=%llx %u %u %u\n", NP, round, group, lid, (u64)dst, bucket, bktIdx,dstIdx[bucket]);

and all my round 0 output was

NP=1 round=0 group=75 lid=303 dst=7f09b8000000 0 3 5
NP=1 round=0 group=66 lid=256 dst=7f09b8000000 0 6 8
NP=1 round=0 group=78 lid=196 dst=7f09b8000000 0 11 12
NP=1 round=0 group=43 lid=273 dst=7f09b8000000 0 0 2
NP=1 round=0 group=66 lid=422 dst=7f09b8000000 0 5 8
NP=1 round=0 group=131 lid=493 dst=7f09b8000000 0 1 3
NP=1 round=0 group=66 lid=82 dst=7f09b8000000 0 2 3
NP=1 round=0 group=103 lid=137 dst=7f09b8000000 0 9 12
NP=1 round=0 group=131 lid=305 dst=7f09b8000000 0 4 5
NP=1 round=0 group=95 lid=136 dst=7f09b8000000 0 10 12
NP=1 round=0 group=79 lid=273 dst=7f09b8000000 0 7 9
NP=1 round=0 group=43 lid=127 dst=7f09b8000000 0 12 14
NP=1 round=0 group=70 lid=277 dst=7f09b8000000 0 8 11
NP=1 round=0 group=71 lid=283 dst=7f09b8000000 0 13 14
NP=1 round=0 group=75 lid=69 dst=7f0960000000 0 10 12
NP=1 round=0 group=114 lid=8 dst=7f0960000000 0 9 12
NP=1 round=0 group=67 lid=41 dst=7f0960000000 0 0 2
NP=1 round=0 group=130 lid=258 dst=7f0960000000 0 3 5
NP=1 round=0 group=26 lid=322 dst=7f0960000000 0 4 5
NP=1 round=0 group=122 lid=166 dst=7f0960000000 0 6 8
NP=1 round=0 group=86 lid=355 dst=7f0960000000 0 11 13
NP=1 round=0 group=27 lid=337 dst=7f0960000000 0 2 3
NP=1 round=0 group=123 lid=478 dst=7f0960000000 0 8 10
NP=1 round=0 group=18 lid=333 dst=7f0960000000 0 5 8
NP=1 round=0 group=27 lid=61 dst=7f0960000000 0 1 2
NP=1 round=0 group=63 lid=152 dst=7f0960000000 0 7 8

where the duplicate bktIdx are with different dst.

Can you show me your round=0 output with the above conditional printf?

ok, I get the correctly result with your conditional.

Ok; then I'll assume your duplicate output lines were due to overwhelming the printf buffers.
Perhaps you can slowly relax the conditional and see where the duplication starts to arise.
Also, when you print the group and lid, you can check if you really got two different threads
claiming to use the same destination...