sts10 / generated-wordlists

A repository for word lists I've generated

Home Page:https://sts10.github.io/2020/09/30/making-a-word-list.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Checking your math...

bwbug opened this issue · comments

commented

In The Similar-Words Problem in your Readme, you wrote:

If we assumed a hypothetical 18,000 word list that was just 9,000 words and their plurals, I think the odds of getting at least one "awkward double" in a 4-word passphrase is (1/18000) * (2/18000) * (3/18000), which is a really small number. But check my math!

Although your conclusion is correct ("the odds...is a really small number"), the odds of this happening is over 600 million times more probable than what you have estimated.

The correct probability is 1/9000 + 2/9000 + 3/9000 - 11/9000**2 + 6/9000**3.

To prove this for a word list containing N words and their plurals (2 N words total), if P1 is the probability of getting at least one "awkward double", and if P0 is the probability of getting no awkward doubles, then

P1 =1 - P0

The probability if getting no awkward doubles (P0) is the number of passphrases containing only unique stems (i.e., once a word has been selected, it cannot be reselected itself, and neither can its conjugate -- the plural or singular form, whichever was not picked in the previous selection), divided by the total number of possible passphrases. For a passphrase consisting of k words, the total number of passphrases is

Ntotal = (2 N)k

To compute the number of passphrases containing only unique stems, the size of the word pool is reduced by 2 each time a word is selected (because the word itself is eliminated from further consideration, as is the plural/singular form of that word):

Nunique = (2 N)(2 N - 2)(2 N - 4) ... (2 N - 2 k +2)

Therefore, the probability of getting only unique stems is

P0 = Nunique/Ntotal = (1 - (1/N))(1 - (2/N))...(1 - (k-1)/N)

Therefore, the general solution for the probability of getting at least one "awkward double" is

P1 = 1 - (1 - (1/N))(1 - (2/N))...(1 - (k-1)/N)

For k=4, the math works out to the following result:

P1 = 6/N - 11/N2 + 6/N3

If, in the general solution, one neglects higher-order terms (N-2, N-3, etc.), the following approximate solution is obtained:

P1 ≈ 1/N + 2/N + ... + (k - 1)/N = (k(k - 1))/(2 N)

I can't quite follow this all the way through right now, but I trust your work! README has been updated. Thank you so much!

commented

Thanks! Let me know if you want me to try to clarify any particular steps of the derivation -- I did skip over some algebra steps in a few of the equations above.