Regularize the dictionary sizes for each language

Question

Regularize the dictionary sizes for each language

bknowles opened this issue 9 years ago · comments

So, I cloned the repo and checked out the code.

787853 lines for PT.pm? Seriously? Your program fails to do its job if the dictionary you're choosing from is too large for the humans to be able to immediately recognize and understand most of the words. Most people have a working vocabulary of about five to ten thousand words, so having a dictionary that is much more than ten thousand words is already stretching it a bit, but not too excessively much.

But three quarters of a million words?!? Even "huge" dictionaries only have on the order of ninety to a hundred thousand words. I can't imagine a dictionary that would have 750,000 words.

Bart Busscots · Answer 1 · Thu Aug 13 2015 05:45:28 GMT+0800 (China Standard Time)

This is something I need help with from native speakers of the various languages.

I used the best (or is that least-bad) free and open source dictionary files I found online.

I'm not sure if it would be easier to start over with a different dictionary for each of the non-English languages, or if a native speaker could trim these existing dictionaries down to a more sane size.

Bottom line - this is very much on my radar, but, not something I can do without help from the community.

6mot, Tom · Answer 2 · Wed Oct 07 2015 10:48:38 GMT+0800 (China Standard Time)

I’ve started working on a German word list, based mainly on the frami Hunspell dictionary and maybe with some additions from the WinEdit dictionary (the one that comes with HSXKPasswd). I’m aiming at something between 20.000 and 80.000 words.

Just one question: As far as I can tell the minimum word size of 4 chars is hardcoded. Are there plans to make this user-configurable in the future? If not, I’ll discard the shorter words.

Tom

Bart Busscots · Answer 3 · Wed Oct 07 2015 17:35:40 GMT+0800 (China Standard Time)

@tflo fantastic - thanks!

There are no plans to allow words shorter than 4 letters, so you can safely ignore them.

6mot, Tom · Answer 4 · Thu Oct 29 2015 07:53:37 GMT+0800 (China Standard Time)

I couldn’t spare too much time recently but I already filtered the four-, five- and six-letter words from the Hunspell dictionary. I’ll continue this way up to eight-letter words and I’ll add a reasonable amount of longer words, too. (Up to 12 letters or a bit more.) I will also add words from the WinEdit dictionary.

So far the results are not too shabby. (“Not too shabby” = easy to memorize.)

For example, what I just got with my 6-word (diceware-like) setting:

Urin:hupen:beste:Putin:Bombe:Toxin

I like that one ;-)

You can download my —draft— lists from this directory.
They still contain the outcommented words. The current list has 7057 active words (only 4-, 5- and 6-letter words, up to now).

Senya · Answer 5 · Mon Feb 27 2017 08:25:44 GMT+0800 (China Standard Time)

Isn't your English dictionary too small? It is about 1000 words. Is it enough to provide comparable combinations count with 8-character latin, digits and special symbols?

Bart Busscots · Answer 6 · Mon Feb 27 2017 08:40:21 GMT+0800 (China Standard Time)

@cmrd-senya it could definitely do with being bigger. I'd be delighted to accept a pull request with a bigger one (preferably free of 'naughty' words of course).

Michael Shulman · Answer 7 · Sun May 21 2017 07:06:28 GMT+0800 (China Standard Time)

Maybe this? http://gcide.gnu.org.ua/download

But I glanced at the resulting dictionary, and it will take some work to clean this up to be a word list.

Or this: https://github.com/first20hours/google-10000-english
It's the 10k most common English words. That multiplies your English entropy by log2(10) (if my math is right). And this list removes swear words:
https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-no-swears.txt

That list does include 1, 2 and 3 letter words, but if you remove them, there are still 8,229 words. I'll extract that list and send you a PR.