boundary / folsom

Expose Erlang Events and Metrics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is the uniform histogram type wrong?

jlouis opened this issue · comments

Please take a look at the code at

https://github.com/boundary/folsom/blob/master/src/folsom_sample_uniform.erl#L50

which updates a sample uniformly in the histogram reservoir. The L46 clause is hit whenever we have fewer than 1028 samples and we insert a new sample in the table. Once we have 1028 samples, we look at N. Suppose N is 2056 since we have taken that many samples. We take a random value, which could be 1768 and then maybe update the reservoir. In half the cases, we won't be bumping the reservoir here, depending ont he random outcome.

I have reservoir's with N > 1000_1000_1000. They will almost never update the reservoir. Is this intended behavior of the uniform sample type? I am afraid some of the logic is wrong and we never ever replace entries in the reservoir for large N.

I could change to slide_uniform to fix this, but I want to make sure I understand how this is supposed to work.

Oh, looks like this is how Vitter's Algorithm R works. So it is more a question of specification than one of implementation mistake.

Thanks for checking @jlouis. @russelldb might have some thoughts on how this would effect slide_uniform.

I need to look into it too. It seems @Vagabond made this change 2be6249#diff-b7a6cde361f08ae87401c3a98eb116c7 that changes the behaviour of the slide_uniform sample.

Ah, it looks like @Vagabond actually fixes slide_uniform as it was using Size not MCnt as the upper bound for the random number generation (i.e. a fixed, rather than growing value.)

So is the plan to change the sample algorithm?

The code as it works now does the right thing. A uniform histogram samples its reservoir over the entire set of inputs. So if you have a million inputs already, the chance of replacing into the sample reservoir is 1/1000. Which is what you want for the complete overview.

Most of the time however, you want some kind of histogram window in which case the slide_uniform or exdec solutions are more appropriate. So it is really a questions of "what do you want". Not a question of what is correct or wrong.

The reason I opened this was because I was getting unexpected data. But it turns out the data are correct and my expectations of what uniform does was wrong :) You can close this one.

@russelldb I don't think we should change anything. But we should definitely consider explaining what the histogram types does in the README :)

Folsom has moved, please resubmit your issue at https://github.com/folsom-project Thanks!