Is the uniform histogram type wrong?

Question

Is the uniform histogram type wrong?

jlouis opened this issue 10 years ago · comments

Jesper Louis Andersen commented 10 years ago

Please take a look at the code at

https://github.com/boundary/folsom/blob/master/src/folsom_sample_uniform.erl#L50

which updates a sample uniformly in the histogram reservoir. The L46 clause is hit whenever we have fewer than 1028 samples and we insert a new sample in the table. Once we have 1028 samples, we look at N. Suppose N is 2056 since we have taken that many samples. We take a random value, which could be 1768 and then maybe update the reservoir. In half the cases, we won't be bumping the reservoir here, depending ont he random outcome.

I have reservoir's with N > 1000_1000_1000. They will almost never update the reservoir. Is this intended behavior of the uniform sample type? I am afraid some of the logic is wrong and we never ever replace entries in the reservoir for large N.

I could change to slide_uniform to fix this, but I want to make sure I understand how this is supposed to work.

Jesper Louis Andersen · Answer 1 · Mon Jul 07 2014 21:01:28 GMT+0800 (China Standard Time)

Oh, looks like this is how Vitter's Algorithm R works. So it is more a question of specification than one of implementation mistake.

Joe Williams · Answer 2 · Wed Jul 09 2014 04:29:45 GMT+0800 (China Standard Time)

Thanks for checking @jlouis. @russelldb might have some thoughts on how this would effect slide_uniform.

Russell Brown · Answer 3 · Wed Jul 09 2014 15:08:48 GMT+0800 (China Standard Time)

I need to look into it too. It seems @Vagabond made this change 2be6249#diff-b7a6cde361f08ae87401c3a98eb116c7 that changes the behaviour of the slide_uniform sample.

Russell Brown · Answer 4 · Wed Jul 09 2014 15:15:59 GMT+0800 (China Standard Time)

Ah, it looks like @Vagabond actually fixes slide_uniform as it was using Size not MCnt as the upper bound for the random number generation (i.e. a fixed, rather than growing value.)

So is the plan to change the sample algorithm?

Jesper Louis Andersen · Answer 5 · Wed Jul 09 2014 17:21:46 GMT+0800 (China Standard Time)

The code as it works now does the right thing. A uniform histogram samples its reservoir over the entire set of inputs. So if you have a million inputs already, the chance of replacing into the sample reservoir is 1/1000. Which is what you want for the complete overview.

Most of the time however, you want some kind of histogram window in which case the slide_uniform or exdec solutions are more appropriate. So it is really a questions of "what do you want". Not a question of what is correct or wrong.

The reason I opened this was because I was getting unexpected data. But it turns out the data are correct and my expectations of what uniform does was wrong :) You can close this one.

Jesper Louis Andersen · Answer 6 · Wed Jul 09 2014 17:23:06 GMT+0800 (China Standard Time)

@russelldb I don't think we should change anything. But we should definitely consider explaining what the histogram types does in the README :)

Joe Williams · Answer 7 · Wed Oct 28 2015 22:06:48 GMT+0800 (China Standard Time)

Folsom has moved, please resubmit your issue at https://github.com/folsom-project Thanks!