Calculate the Uniqueness

Question

Calculate the Uniqueness

ekelvin opened this issue 4 years ago · comments

It is really great to use shorter unique id, however as we all know this comes with a price.
It would be good to be able to know based on the options what is the probability to have the same id again.
This is really important as will increase easily the usage number of this library once the probability of your usage is easily known.
example for UUID V4 is: ...Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion (ref: https://en.wikipedia.org/wiki/Universally_unique_identifier )

Jean Lescure · Answer 1 · Mon Apr 27 2020 20:26:12 GMT+0800 (China Standard Time)

Hi @ekelvin! Really glad you find the library to be useful :)

I had actually included a task in issue #11 to document why 6 was chosen as the default character length. But now that you mention it, I might as well add a function that returns a probability that a collision may be encountered, what you called uniqueness.

There are two values needed to calculate this, which are:

the total number of possible UUIDs (Hashes) in relation to the given dictionary (let's call this number H)
and the expected number of values we have to choose before finding the first collision (let's call this quantity Q(H))

H is simply the number n of unique characters in the dictionary to the power of the UUID length l:

$H=n^l$

And Q(H) can be approximated as:

$Q(H)\approx\sqrt{\frac{\pi}{2}H}$

(source)

So uniqueness (in the case of this library) in a scale from 0 to 1 would be defined as:

$1-\frac{Q(H)}{H}$

@ekelvin, I shall add this task to our v3 proposals (#11) and will keep this issue open until we merge to master (max. by May 14th, although it seems we might be ready to release this week 🤞 ).

Cheers!

Jean Lescure · Answer 2 · Mon Apr 27 2020 22:45:34 GMT+0800 (China Standard Time)

I've started implementing this feature.

Just wanted to note that the aforementioned uniqueness value assumes that one will perform H rounds of UUID generation. In other words as stated the uniqueness value would be the answer to the problem:

I have a set of n characters and am allowed to create "words" of length l. If I iterate $n^{l}$ times and on each iteration I generate a new "word" by selecting random characters from the set, what is the additive inverse of the probability of generating a "word" I had previously generated (a duplicate) at any given iteration?

As such, the value is useful in of itself as a score of sorts, but I'm sure people will be more inclined to ask:

If I use this lib and expect to perform at most r rounds of UUID generations, what is the probability p that I will hit a duplicate UUID?

The answer would be approximately:

$p(r; H)\approx\frac{\sqrt{\frac{\pi}{2}r}}{H}$

So we should probably implement a function that receives a number of rounds as input and generates the aforementioned probability 🤓

Kelvin · Answer 3 · Tue Apr 28 2020 16:30:55 GMT+0800 (China Standard Time)

That is fantastic :)