Minimum edit distance and shared prefix length make some potentially problematic assumptions
sts10 opened this issue · comments
Both --minimum-edit-distance
and --shared-prefix-length
always prefer shorter words when choosing between 2 words.
When I was implementing these two features, I figured that was a fine assumptions. But now that I think about it, if the inputted word list is sorted by desirability, this desirability information is thrown away when Tidy carries out either --minimum-edit-distance
and --shared-prefix-length
.
I could re-write these two functions to prefer the word that is higher on the input list, BUT for an alphabetically sorted input list, this output might be weirdly skewed toward the front of the alphabet.
This could be a reason to implement a --is-sorted
boolean flag, so users could tell Tidy whether the inputted list is sorted by some desirable metric.
As a work-around for avoiding skewing of alphabetically sorted lists, a user could specify --take-rand
with an argument equal to the full list size. Or you could automatically randomize the list before running the --minimum-edit-distance
and --shared-prefix-length
, unless the user has specified -O
.
Some variation of an --is-sorted
boolean may turn out to be a good solution, but I wanted to throw out the above alternatives for consideration as well. At this time, I'm not making a case for one alternative over another.
"desirability" is going to make everything more complicated. Do you feel that it is worth it? I do see a value having more familiar words more likely, but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists. So if you are aiming for 7776, just use an input of the 10,000 most common words without actually rating desirability among them.
"desirability" is going to make everything more complicated.
I think I can keep it relatively uncomplicated by using the order of the inputted word list as a proxy for desirability. This would help me avoid using something like struct Word { s: String, desirability: uint32 }
throughout the code base.
In practice, I'd add a --is-sorted
boolean flag. If that was set to true
by the user, whenever Tidy executes a filter that requires an arbitrary choice between two words (I think minimum edit distance and shared prefix length are the only two so far, hence this issue), it would prefer whichever word was first in the given input order. Otherwise, it could continue preferring the shorter word, as a loose stand-in for desirability.
but under the assumption that the user knows the the target size they are trying to achieve, they could truncate their input lists.
Yeah, I'm coming around this view... (And Tidy has the --take-first
and --whittle-to
options to make this truncate process easier.)
I made Tidy while working with large word lists that were sorted by frequency, whether from Google Books or Wikipedia. These lists are long and toward the end we get strange words like "aude" and "paniculate". I think I got a little caught up with the idea of automating everything, such that I wouldn't even have to arbitrarily cut the input list down before proceeding. But in my experience actually making word lists, there are plenty of "human"/"arbitrary" choices that need to be made to make a good one.