Reported hitting set size different from created set size?

Question

Reported hitting set size different from created set size?

dpellow opened this issue 3 years ago · comments

Hi
I tried to run pasha with k=10 per the instructions and I am seeing that the size of the hitting set that is printed to stdout is consistently larger than the number of lines in the output file.
Do you also observe this? Do you know what might be causing it?
Possibilities I can think of are:

A race condition between different threads when setting or evaluating pick[i] in parallel
the if condition on line 292 is evaluating to something different than expected so that the else on line 304 is not run
It could definitely be something else as well since I didn't dive too deeply into the code.

Barış Ekim · Answer 1 · Wed Feb 09 2022 23:20:45 GMT+0800 (China Standard Time)

Hello David,

Have you tried concatenating the decyc[k].txt and hit[k][L].txt files? The additional set needs to be added to the decycling set to obtain the UHS.

If so, I'll look into this soon to see what might be wrong - it could be bad parallelization.

David Pellow · Answer 2 · Thu Feb 10 2022 02:56:00 GMT+0800 (China Standard Time)

Yes I checked and that is not it. For example for k=10 the decycling set size (reported, and in the file) is 104968, the reported hitting set size for L=20 is 131874 while the size of the set in the file is 111835, for L=30 the reported hitting set size is 57726 while in the file there are 57331 lines. This is consistent for all factors of 10 from 20 to 200 - the reported number is somewhat higher than the size of the set in the file. I am running it with 60 threads, so maybe the high level of parallelization makes it more of an issue than when fewer threads are used,

David Pellow · Answer 3 · Fri Feb 11 2022 18:38:02 GMT+0800 (China Standard Time)

@ekimb were you able to figure this out? I'm trying to use Pasha for benchmarking and want to know if there are kmers missing in the sets I generated. Thanks!

David Pellow · Answer 4 · Wed Feb 23 2022 22:51:05 GMT+0800 (China Standard Time)

I've bumped up to k=12 and now the reported sizes are 2-3x bigger than the sizes of the set in the file. @ekimb are you able to solve this?

David Pellow · Answer 5 · Wed Feb 23 2022 23:35:12 GMT+0800 (China Standard Time)

@ekimb from what I can tell the set is written to the file only on line 298. If this is the whole set and not done in parallel (I'm not sure about this?) Can't you just count the set size there?
Why is the other set size based on hittingCountStage adding up to something else?

David Pellow · Answer 6 · Fri Feb 25 2022 03:51:23 GMT+0800 (China Standard Time)

@ekimb I wrote a test for the hitting sets in the output files. Assuming the toposort and maxlength implementations in Pasha are correct then the output sets are UHSes. This means that it is just the reporting of Pasha that is wrong.
It is worth fixing this if your or others results relied on the reported sizes, since in some cases the UHSes generated are much smaller than what is reported and the set size is usually a critical metric.