bbusschots / hsxkpasswd

A Perl module and terminal command for generating secure memorable passwords inspired by the fabulous XKCD web comic and Steve Gibson's Password Hay Stacks. This is the library that powers www.xkpasswd.net

Home Page:http://www.bartb.ie/xkpasswd

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regularize the dictionary sizes for each language

bknowles opened this issue · comments

So, I cloned the repo and checked out the code.

787853 lines for PT.pm? Seriously? Your program fails to do its job if the dictionary you're choosing from is too large for the humans to be able to immediately recognize and understand most of the words. Most people have a working vocabulary of about five to ten thousand words, so having a dictionary that is much more than ten thousand words is already stretching it a bit, but not too excessively much.

But three quarters of a million words?!? Even "huge" dictionaries only have on the order of ninety to a hundred thousand words. I can't imagine a dictionary that would have 750,000 words.

This is something I need help with from native speakers of the various languages.

I used the best (or is that least-bad) free and open source dictionary files I found online.

I'm not sure if it would be easier to start over with a different dictionary for each of the non-English languages, or if a native speaker could trim these existing dictionaries down to a more sane size.

Bottom line - this is very much on my radar, but, not something I can do without help from the community.

I’ve started working on a German word list, based mainly on the frami Hunspell dictionary and maybe with some additions from the WinEdit dictionary (the one that comes with HSXKPasswd). I’m aiming at something between 20.000 and 80.000 words.

Just one question: As far as I can tell the minimum word size of 4 chars is hardcoded. Are there plans to make this user-configurable in the future? If not, I’ll discard the shorter words.

Tom

@tflo fantastic - thanks!

There are no plans to allow words shorter than 4 letters, so you can safely ignore them.

I couldn’t spare too much time recently but I already filtered the four-, five- and six-letter words from the Hunspell dictionary. I’ll continue this way up to eight-letter words and I’ll add a reasonable amount of longer words, too. (Up to 12 letters or a bit more.) I will also add words from the WinEdit dictionary.

So far the results are not too shabby. (“Not too shabby” = easy to memorize.)

For example, what I just got with my 6-word (diceware-like) setting:

Urin:hupen:beste:Putin:Bombe:Toxin

I like that one ;-)

You can download my —draft— lists from this directory.
They still contain the outcommented words. The current list has 7057 active words (only 4-, 5- and 6-letter words, up to now).

commented

Isn't your English dictionary too small? It is about 1000 words. Is it enough to provide comparable combinations count with 8-character latin, digits and special symbols?

@cmrd-senya it could definitely do with being bigger. I'd be delighted to accept a pull request with a bigger one (preferably free of 'naughty' words of course).

Maybe this? http://gcide.gnu.org.ua/download

But I glanced at the resulting dictionary, and it will take some work to clean this up to be a word list.

Or this: https://github.com/first20hours/google-10000-english
It's the 10k most common English words. That multiplies your English entropy by log2(10) (if my math is right). And this list removes swear words:
https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-no-swears.txt

That list does include 1, 2 and 3 letter words, but if you remove them, there are still 8,229 words. I'll extract that list and send you a PR.