sts10 / tidy

Combine and clean word lists

Home Page:https://sts10.github.io/2021/12/09/tidy-0-2-0.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(Nontrivial) use-cases for --take-rand?

sts10 opened this issue · comments

Just wondering what a (non-trivial) use of the --take-rand option is, especially now that we have a well-thought-out --print-rand option, which allows users a more "accurate" handle on how long the resulting list will be (since it makes its cuts later in the process).

If we can't come up with a non-trivial use for it, I'm not necessarily in favor of removing the option -- I think the symmetry with --take-first and --print-rand is nice to have. But that's a natural follow-up...

Found one: Let's say you have an alphabetically sorted 100,000-word list (or multiple lists that total 100,00 words) as your input. You want to make a 7,776-word list. And let's say you want these 7,776 words to be Schlinkert pruned, a time-intense process.

tidy -m 3 -K --print-rand 7776 100k.txt will take hours and hours to run, because Tidy is running the Schlinkert prune on almost all 100,000 words (mercifully, words under 3 characters won't be passed on to the Schlinkert prune process).

Running tidy --take-rand 12000 -m 3 -K --print-rand 7776 100k.txt is much faster, since take-rand cuts the inputted 100k words down to 12k before further processing. On my laptop, it finished in under 5 minutes.

commented

--take-rand could be used repeatedly, if one wants to check how sensitive the outcome (say, some attribute) of some wordlist-generating method (tidy configuration) is to differences in input lists. For example, to check how variable the number of words cut by the Schlinkert pruning algorithm is, you can repeatedly run

tidy -KILA -ds --take-rand 1000 --dry-run enwiki-20210820-words-frequency.txt