ekimb / pasha

parallel algorithms for small hitting set approximations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reported hitting set size different from created set size?

dpellow opened this issue · comments

Hi
I tried to run pasha with k=10 per the instructions and I am seeing that the size of the hitting set that is printed to stdout is consistently larger than the number of lines in the output file.
Do you also observe this? Do you know what might be causing it?
Possibilities I can think of are:

  1. A race condition between different threads when setting or evaluating pick[i] in parallel
  2. the if condition on line 292 is evaluating to something different than expected so that the else on line 304 is not run
    It could definitely be something else as well since I didn't dive too deeply into the code.

Hello David,

Have you tried concatenating the decyc[k].txt and hit[k][L].txt files? The additional set needs to be added to the decycling set to obtain the UHS.

If so, I'll look into this soon to see what might be wrong - it could be bad parallelization.

Yes I checked and that is not it. For example for k=10 the decycling set size (reported, and in the file) is 104968, the reported hitting set size for L=20 is 131874 while the size of the set in the file is 111835, for L=30 the reported hitting set size is 57726 while in the file there are 57331 lines. This is consistent for all factors of 10 from 20 to 200 - the reported number is somewhat higher than the size of the set in the file. I am running it with 60 threads, so maybe the high level of parallelization makes it more of an issue than when fewer threads are used,

@ekimb were you able to figure this out? I'm trying to use Pasha for benchmarking and want to know if there are kmers missing in the sets I generated. Thanks!

I've bumped up to k=12 and now the reported sizes are 2-3x bigger than the sizes of the set in the file. @ekimb are you able to solve this?

@ekimb from what I can tell the set is written to the file only on line 298. If this is the whole set and not done in parallel (I'm not sure about this?) Can't you just count the set size there?
Why is the other set size based on hittingCountStage adding up to something else?

@ekimb I wrote a test for the hitting sets in the output files. Assuming the toposort and maxlength implementations in Pasha are correct then the output sets are UHSes. This means that it is just the reporting of Pasha that is wrong.
It is worth fixing this if your or others results relied on the reported sizes, since in some cases the UHSes generated are much smaller than what is reported and the set size is usually a critical metric.