sts10 / tidy

Combine and clean word lists

Home Page:https://sts10.github.io/2021/12/09/tidy-0-2-0.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Minimum edit distance and shared prefix length make some potentially problematic assumptions

sts10 opened this issue · comments

Both --minimum-edit-distance and --shared-prefix-length always prefer shorter words when choosing between 2 words.

When I was implementing these two features, I figured that was a fine assumptions. But now that I think about it, if the inputted word list is sorted by desirability, this desirability information is thrown away when Tidy carries out either --minimum-edit-distance and --shared-prefix-length.

I could re-write these two functions to prefer the word that is higher on the input list, BUT for an alphabetically sorted input list, this output might be weirdly skewed toward the front of the alphabet.

This could be a reason to implement a --is-sorted boolean flag, so users could tell Tidy whether the inputted list is sorted by some desirable metric.

commented

As a work-around for avoiding skewing of alphabetically sorted lists, a user could specify --take-rand with an argument equal to the full list size. Or you could automatically randomize the list before running the --minimum-edit-distance and --shared-prefix-length, unless the user has specified -O.

Some variation of an --is-sorted boolean may turn out to be a good solution, but I wanted to throw out the above alternatives for consideration as well. At this time, I'm not making a case for one alternative over another.

"desirability" is going to make everything more complicated. Do you feel that it is worth it? I do see a value having more familiar words more likely, but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists. So if you are aiming for 7776, just use an input of the 10,000 most common words without actually rating desirability among them.

"desirability" is going to make everything more complicated.

I think I can keep it relatively uncomplicated by using the order of the inputted word list as a proxy for desirability. This would help me avoid using something like struct Word { s: String, desirability: uint32 } throughout the code base.

In practice, I'd add a --is-sorted boolean flag. If that was set to true by the user, whenever Tidy executes a filter that requires an arbitrary choice between two words (I think minimum edit distance and shared prefix length are the only two so far, hence this issue), it would prefer whichever word was first in the given input order. Otherwise, it could continue preferring the shorter word, as a loose stand-in for desirability.

but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists.

Yeah, I'm coming around this view... (And Tidy has the --take-first and --whittle-to options to make this truncate process easier.)

I made Tidy while working with large word lists that were sorted by frequency, whether from Google Books or Wikipedia. These lists are long and toward the end we get strange words like "aude" and "paniculate". I think I got a little caught up with the idea of automating everything, such that I wouldn't even have to arbitrarily cut the input list down before proceeding. But in my experience actually making word lists, there are plenty of "human"/"arbitrary" choices that need to be made to make a good one.