Checking your math...
bwbug opened this issue · comments
In The Similar-Words Problem in your Readme, you wrote:
If we assumed a hypothetical 18,000 word list that was just 9,000 words and their plurals, I think the odds of getting at least one "awkward double" in a 4-word passphrase is (1/18000) * (2/18000) * (3/18000), which is a really small number. But check my math!
Although your conclusion is correct ("the odds...is a really small number"), the odds of this happening is over 600 million times more probable than what you have estimated.
The correct probability is 1/9000 + 2/9000 + 3/9000 - 11/9000**2 + 6/9000**3
.
To prove this for a word list containing N words and their plurals (2 N words total), if P1 is the probability of getting at least one "awkward double", and if P0 is the probability of getting no awkward doubles, then
P1 =1 - P0
The probability if getting no awkward doubles (P0) is the number of passphrases containing only unique stems (i.e., once a word has been selected, it cannot be reselected itself, and neither can its conjugate -- the plural or singular form, whichever was not picked in the previous selection), divided by the total number of possible passphrases. For a passphrase consisting of k words, the total number of passphrases is
Ntotal = (2 N)k
To compute the number of passphrases containing only unique stems, the size of the word pool is reduced by 2 each time a word is selected (because the word itself is eliminated from further consideration, as is the plural/singular form of that word):
Nunique = (2 N)(2 N - 2)(2 N - 4) ... (2 N - 2 k +2)
Therefore, the probability of getting only unique stems is
P0 = Nunique/Ntotal = (1 - (1/N))(1 - (2/N))...(1 - (k-1)/N)
Therefore, the general solution for the probability of getting at least one "awkward double" is
P1 = 1 - (1 - (1/N))(1 - (2/N))...(1 - (k-1)/N)
For k=4, the math works out to the following result:
P1 = 6/N - 11/N2 + 6/N3
If, in the general solution, one neglects higher-order terms (N-2, N-3, etc.), the following approximate solution is obtained:
P1 ≈ 1/N + 2/N + ... + (k - 1)/N = (k(k - 1))/(2 N)
I can't quite follow this all the way through right now, but I trust your work! README has been updated. Thank you so much!
Thanks! Let me know if you want me to try to clarify any particular steps of the derivation -- I did skip over some algebra steps in a few of the equations above.