sts10 / tidy

Combine and clean word lists

Home Page:https://sts10.github.io/2021/12/09/tidy-0-2-0.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: Retain word metadata or output in original sort order

bwbug opened this issue · comments

commented

First of all, great work! I am just getting started with passphrases and word lists, and expect this tool to be a tremendously useful resource.

Let me try to explain my request. Word list creation may include additional metadata about each word. For example, EFF considered both "prevalence" and "concreteness" when designing their Long Wordlist. In general, one might want to create a word list that contains the first N words sorted according to some metadata quantity (e.g., a weighted score of "prevalence" and "concreteness"). However, the default behavior of returning the new word list sorted alphabetically makes this difficult.

Thus, a simple enhancement would be to include an option to return the new word list in its original sort order, instead of being sorted alphabetically. A more involved functionality enhancement would be to accept input word lists that contain one or more comma-separated values following each word, and preserve these values unchanged in the final output (this would allow the user to manually sort or filter the resulting list after it has been processed by Tidy).

Thanks for the compliment! Hope you find the tool useful.

... a simple enhancement would be to include an option to return the new word list in its original sort order, instead of being sorted alphabetically.

Love this idea. I decided to keep the alphabetical sort by default, but allow users to override it with a new option:

    -O, --no-alpha
            Do NOT sort outputted list alphabetically. Preserves original list order. Note that
            duplicates lines and blank lines will still be removed

See 9c4f904.

A more involved functionality enhancement would be to accept input word lists that contain one or more comma-separated values following each word, and preserve these values unchanged in the final output

I'm a bit confused by this. Currently, a list like:

and,1000
toucan,4
needless,50

would keep the commas and integers if you ran it through Tidy without any options.

I'd say that being able to sort words like the above by the after-comma integer is a bit outside of the scope of Tidy, but I maybe could be convinced...

commented

I have to confess that I haven't yet experimented directly with Tidy (because I don't yet have Rust installed, and I need to figure out what kind of footprint that would leave on my laptop before I do -- any chance of you distributing an executable for Windows?), so I wasn't aware that the current behavior was to preserve comma-separated data. In this case, no additional changes would be necessary.

I agree with your comment that sorting on the numerical "column" would be outside the scope of Tidy. This can easily be done with other tools, since the numerical values are preserved in the output.

Totally understand the hesitancy to install Rust just to use Tidy.

any chance of you distributing an executable for Windows?

Unfortunately I don't have much experience with Rust binaries, but as an experiment I created a release with a binary here: https://github.com/sts10/tidy/releases/tag/v0.2.30

I give it about a 25% chance ti works for Windows/you, but figured it was worth a shot. I'll keep poking around for a "How to make a Rust binary for Windows" blog post...

commented

Thank you! I will try the binary* after I back up. I found this write-up on how to distribute binaries created from Rust, maybe it will be helpful:

https://rust-cli.github.io/book/tutorial/packaging.html#distributing-binaries

*Update: Either it didn't work, or I wasn't able to figure out how to get it to work (maybe I need to manually add a .exe extension to the tidy file after downloading?). No worries, I have a Windows computer at another location where I could install Rust and then use the instructions at the link above to create a portable binary.

Ah, sorry.

Let me know if you get a binary working and I'll gladly host it on this repo!

commented

Hopefully you saw the binary that I posted in Issue #10.

Now that I've had a chance to experiment a bit with using tidy, I think that this feature request (#8) could be re-opened (to address the part about retaining metadata).

Yes it is true that tidy currently retains any trailing comma-separated values when processing a file. However, it apparently considers these characters to be part of the word that appears at the start of each row, which is not what I was looking for. What I meant is that all trailing data (following and including the first comma -- alternative separator characters could also be considered) should be ignored when doing all analysis and modification of the word list, and then appended unchanged to the corresponding entry in the new word list.

For example, suppose we have the following original word list words.txt, which includes rank and frequency metadata about each word:

the,1,175029308
of,2,83135650
in,3,71869293

Currently running tidy -in words.txt will produce the following:

in
of
the

However, what I would want is instead the following:

in,3,71869293
of,2,83135650
the,1,175029308

Similarly, something like tidy -Om 3 words.txt should output

the,1,175029308

(whereas currently, no words are removed from the list).

This request is also somewhat related to the P.S. at the end Issue #11 (in which I mention the problem of stripping information that trails each word). Perhaps the two issues could be dealt with together, because both involve terminating the reading of each word at the first appearance of some delimiter (e.g., comma, tab, space, or even something user-defined) -- in some cases we want to ignore but preserve the trailing information, but in other cases we want to discard the trailing information.

What do you think?

Ah, yes, now I understand, thanks to the tidy -Om 3 words.txt example. It'd be like an --ignore-but-preserve-after-first-comma option, haha.

Given the way I laid out the code, this would be difficult to implement without making a big mess in the code. But now that I understand the request i can mull it over a bit. Maybe there is a creative solution.

commented

I see a duality between this feature ("ignore from 1st separator") and the new "delete from 1st separator" options. Not sure if that helps with implementation. It may help with UI design, though (e.g., the -C -S -T flags could delete from the corresponding separator unless there is also a -K flag present, which would signify "keep").

I know nothing about Rust programming paradigms, but I would envision maybe first creating an index vector (1, 2, 3, ..., N) with the same number of entries as the original list, and then splitting the rows into two string vectors (which you should be able to do the same way that you do -s -S -t -T, etc.). The tricky part would be to keep track of every word removal, and ensure that you also remove the corresponding index value from the index vector. If you can do that, then the rest "should" be simple -- just use the index vector to prune the "metadata" vector, and then join the "word" and "metadata" vectors together, element-by-element.

A second idea* would be using an approach where every word removal action does not actually remove the word, but instead replaces it with some dummy word that could never be mistaken for a valid word, and which would not be altered by any of the filters that delete or replace characters (e.g., aAaAaAaAaAaAaAaAaAa, or xxxTxxxIxxxDxxxYxxx) -- you could even check the list at the start to make sure it doesn't contain the dummy word. Then you could just go through at the end (after rejoining the two vectors) and delete any rows that start with the dummy word.

I don't know if either of the above suggestions are feasible, but thought I would share my thoughts in case it gives you some ideas.

*Edited to Add: That second idea may not be so simple after all -- It could be made to work, I think, but many of the filtering actions would have to be modified to include logic that checks for and ignores the dummy word.

Pushed a basic --ignore-metadata option (Think "ignore" is a more descriptive verb than "keep" in this case).

1.) I found that it would be tricky to have it work with options that modify words, like "to lowercase". This is because of the duplication problem.

The,976213
the,230947

It's unclear which metadata to keep if user chooses to lowercase all words. This is true of all word modification options.

2.) I didn't figure out how to make keep-metadata work with option like remove prefix words and remove suffix words. Will have to think about it some more.

3.) List attributes (-A) should respect the metadata delimiter!

commented

Wow, thanks for making this happen so quick! I agree that "ignore" is a better descriptor, but you are running out of letters for flags, and -K was available... Will you be using my idea of letting the -T, -S and -C flags do double duty to specify which delimiter to use for the --ignore-metadata option?

It's unclear which metadata to keep if user chooses to lowercase all words. This is true of all word modification options.

My opinion is that for these types of cases (deduplication when duplication was cause by word modification), the user should either accept that the result will be undefined (i.e., which of the rows will be kept cannot be predicted), or you should always keep the entry that appears earliest on the original list. The latter would be preferable, but the former would be acceptable if it makes implementation easier.

Will you be using my idea of letting the -T, -S and -C flags do double duty to specify which delimiter to use for the --ignore-metadata option?

Ha, I thought of this! -g actually accepts a string, allowing the user to define any delimiter they want. So for example, for the Wiki list you'd run tidy -O -m 3 -g ' ' enwiki-20210820-words-frequency.txt

I guess I should overhaul the -t -s -c flags to just be one that takes a delimiting string like -g does.

commented

So, where the help says -g, --ignore-metadata <IGNORE_METADATA>, should <IGNORE_METADATA> actually be <DELIMITER>?

I think some symmetry/redundancy/synergy is starting to emerge in the UI. You may or may not wish to do the extra retooling to make the following paradigm work, but there are actually three independent dimensions to the options that involve delimiters:

  1. Delimiting Character. You could keep separate flags for comma, space, tab (like you currently do with -C, -S, -T, etc.), or accept a user-defined string (like you currently do with -g); but a consistent method of specifying the delimiter would be best from a UX perspective.
  2. Action. Either discard (delete) all metadata on the non-word side of the delimiter, or ignore (and preserve) all metadata on the non-word side of the delimiter.
  3. Direction/Orientation. The wordlist word is either to the left of the first delimiter (current options -C, -S, -T, -g), or to the right of the first delimiter (current options -s, -t, and --delete-through-comma). This means that for completeness, you would also have to provide the option to ignore metadata values that appear before the first delimiter (or should it really be before the last delimiter, to achieve complete symmetry in the options?), to complement the current -s, -t, and --delete-through-comma options that always delete.

This means that you can have a single user option to specify the delimiting character, and then have optional flags to modulate the behavior. To reduce the number of option flags, you could consider implementing default behaviors, with flags to reverse the default. For example, including something like --delimit <DELIMITER> would by default delete everything except for what follows the first (or should it be last?) delimiter character. However, if the user also specifies a flag or option like --ignore-metadata, then the action will be ignore instead of delete; and if the user also specifies a flag or option like --word-first, then the directions will be reversed, so that everything except for what precedes the first delimiter will be treated as metadata. Alternatively, you can also provide options for the opposite choices (--delete-metadata and word-last, respecively). You can probably come up with a more felicitous choice of names for this different flags/options, but the idea is to take advantage of the independent dimensions to have only 3-5 flags/options instead of 12.

Yep, I was thinking the same this today.

accept a user-defined string (like you currently do with -g)

Yep, I just consolidated all of those delete option to just two:

    -d, --delete-after <DELETE_AFTER_DELIMITER>
            Delete all characters after given delimiter (including the delimiter). Maximum of one
            character. Use 't' for tab and 's' for space

    -D, --delete-through <DELETE_THROUGH_DELIMITER>
            Delete all characters up to and including given delimiter. Maximum of one character. Use
            't' for tab and 's' for space

I like lowercase for forward, uppercase for backwards, as that's how some commands in Vim work. Note that I did introduce a new limit of one character max for delimiters. This makes the Rust easier to work with, and faster too. I think that's fine? When would there by a multi-character delimiter? Also, I like 's' and 't' as shortcuts for space and tab... I find it difficult to enter whitespace on the command line sometimes, especially a tab.

Next, I'll work on a -G flag that ignores metadata before the word. (EDIT: Meant -G here.)

So, where the help says -g, --ignore-metadata <IGNORE_METADATA>, should <IGNORE_METADATA> actually be ?

Whoops -- fixed.

commented

Yes, I like what you have so far (-d vs. -D for the two versions of delete). So will the two versions of ignore be -k and -K (or -g and -G)?

I think a single-character delimiter would be fine in general. Would not be able to deal with something like // or /* for word lists that include comments (such word lists may not even exist). However, what happens currently if there are repeats of the single-character delimiter? For example, I found this wordlist, which includes two spaces between the dice roll and the word. Do you treat repeated delimiters as if it was a single delimiter, or would the extra space character become part of the word (when using -D 's')? If you don't treat repeated delimiters as if they were a single delimiter, then this would be even more reason to re-define the backwards actions (-D and -K) to delete/ignore up to and including the last appearance of the delimiter (a notion I'm partial to anyway, to create symmetry with the forward actions -d and -k).

So will the two versions of ignore be -k and -K (or -g and -G)?

Yes. And I'm switching to -g and -G for "ignore". That -K reference above was a mistake on my part.

However, what happens currently if there are repeats of the single-character delimiter?

A bit by chance, but Tidy splits on the first space, then trims any leading whitespace on the word, which eliminates the second space in this case. However we wouldn't be so lucky if the delimiter with two commas or two of some other non-whitespace character, given the current set up.

If you don't treat repeated delimiters as if they were a single delimiter, then this would be even more reason to re-define the backwards actions (-D and -K) to delete/ignore up to and including the last appearance of the delimiter (a notion I'm partial to anyway, to create symmetry with the forward actions -d and -k).

Oof, I'll have to think more on this some other time.

Phew, think I got -G working!

Here's my quick attempt at the help text. Not sure if starting/ending is the way to go...

    -g, --ignore-ending-metadata <IGNORE_ENDING_METADATA_DELIMITER>
            Ignore metadata after first appearance of given delimiter. Accepts delimiter character
            like ','. Maximum of one character. Use 't' for tab and 's' for space. Treats anything
            before first appearance of delimiter as the "word". Only works with word removals, not
            word modifications (like to lowercase)

    -G, --ignore-starting-metadata <IGNORE_STARTING_METADATA_DELIMITER>
            Ignore metadata before first appearance of given delimiter. Accepts delimiter character
            like ','. Maximum of one character. Use 't' for tab and 's' for space. Treats anything
            after first appearance of delimiter as the "word". Only works with word removals, not
            word modifications (like to lowercase)

Need to do some more testing, maybe tomorrow.

commented

Oof, I'll have to think more on this some other time.

So, currently, repeated whitespace delimiters are in effect treated as a single delimiter, but repeated non-whitespace delimiters are not, correct? In that case, I think it would be OK just to stick with your current paradigm of splitting on the first delimiter (whether working forwards or backwards). Making a symmetric implementation (delete ignore before first vs. after last) is less essential if repeated whitespace is automatically trimmed; the multiple-commas scenario is an edge case.

Not sure if starting/ending is the way to go

My suggestion for the option names and help text would be as follows:

    -d, --delete-from <DELETE_FROM_DELIMITER>
            Delete all characters from the first instance of the specified delimiter until the end of line
            (including the delimiter). Delimiter must be a single character (e.g., ','). Use 't' for tab and
            's' for space. May not be used together with -D, -g or -G options.

    -D, --delete-through <DELETE_THROUGH_DELIMITER>
            Delete all characters up to and including the first instance of the specified delimiter. Delimiter
            must be a single character (e.g., ','). Use 't' for tab and 's' for space. May not be used 
            together with -d, -g or -G options.

    -g, --ignore-from <IGNORE_FROM_DELIMITER>
            Ignore metadata from the first instance of the specified delimiter until the end of line, treating
            anything before the delimiter as a word. Delimiter must be a single character (e.g., ','). Use 't'
            for tab and 's' for space. Works with attribute analysis and word removals, but not with word
            modifications (like to lowercase). May not be used together with -d, -D or -G options.

    -G, --ignore-through <IGNORE_THROUGH_DELIMITER>
            Ignore metadata up to and including the first instance of the specified delimiter, treating
            anything after the delimiter as a word. Delimiter must be a single character (e.g., ','). Use 't'
            for tab and 's' for space. Works with attribute analysis and word removals, but not with word
            modifications (like to lowercase). May not be used together with -d, -D or -g options.

What happens when these options are used together with word modification options? Are those incompatible options just ignored, or does it produce incorrect results (or worse, crash)?

Let me know when you think the latest version is sufficiently stable that I should update the Windows binary.

So, currently, repeated whitespace delimiters are in effect treated as a single delimiter, but repeated non-whitespace delimiters are not, correct? In that case, I think it would be OK just to stick with your current paradigm of splitting on the first delimiter (whether working forwards or backwards). Making a symmetric implementation (delete ignore before first vs. after last) is less essential if repeated whitespace is automatically trimmed; the multiple-commas scenario is an edge case.

Agreed. Keeping it where it is for now.

My suggestion for the option names and help text would be as follows:...

This looks good. I've made these changes with only a minor tweak. I don't love the from/through language, but I don't have an alternative that I like better at the moment.

What happens when these options are used together with word modification options? Are those incompatible options just ignored, or does it produce incorrect results (or worse, crash)?

Technically it crashes? It's a panic! call, so Rust ensures that there's no real computer issues. Plus, Rust's panic! macro allows you to write a message, which I did. Thus, here's what happens if you try to use -G and -g:

❯ tidy -G , -g s --samples l/delimit_test.txt 
thread 'main' panicked at 'Can't ignore metadata on both sides currently', src/lib.rs:169:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

A similar message is displayed if the user tries to use on the g commands with an incompatible modification option:

❯ tidy -G , -l l/delimit_test.txt 
thread 'main' panicked at '--ignore-starting-metadata option does not work with one of the other options you selected. Please reconsider. Exiting', src/main.rs:281:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I can do some work to make the message display a bit cleaner/prettier and/or re-write it, but I think it's fine for now. Nothing catastrophic happens.

I will note that users, thankfully, can use -D and -d together. For example:

mA1,word1 mB1
mA2,word2 mB2
mA3,word3 mB3

We can run tidy -D , -d s l/delimit_test.txt and we get:

word1
word2
word3

I think this is a feasible use-case, so glad it works as expected.

Let me know when you think the latest version is sufficiently stable that I should update the Windows binary.

I'm not sure! I think we've got the -G and -g options in a good place, but this was a lot of changes in a short period of time, so we'll likely catch things in the next few days. Maybe hold for a bit?

commented

I don't love the from/through language, but I don't have an alternative that I like better at the moment.

"before/after"? "left/right"? "leading/trailing"?

We can run tidy -D , -d s l/delimit_test.txt and we get:

Interesting! What would happen for data of the form A,B or A,B,C if you try tidy -D , -d ,?

Maybe hold for a bit?

Sure. I might make one for my own use in the meantime.

"before/after"?

Yeah, I think that makes more sense to me. I've made the (tedious) change.

Interesting! What would happen for data of the form A,B or A,B,C if you try tidy -D , -d ,?

Huh, this is tricky. With that command, A,B,C becomes B (I fixed the order of delete calls to ensure this logical outcome). A,B becomes B -- not sure if that's good or bad!

FYI I fixed two bugs this morning. The important one relates to the fact that I realized that we have to do word modifications BEFORE we do word removals. This is because it's possible for a word to be modified such that, for example, it dips below a user-requested word-length minimum. If we did removals and then modifications, as I had in the last few versions, a word below the requested word-length minimum could make it into the final list.

The second bug was a dumb coding mistake with one of the -g flags.

Both of these bugs are (hopefully) fixed in version 0.2.51 (you can always check the version number in line 3 of the Cargo.toml file). Would obviously recommend you not create a Windows binary of versions earlier than 0.2.51.

commented

Thanks for the update.

A,B becomes B -- not sure if that's good or bad!

I think that's fine. There is only one delimiter, so the onus is on the user not to request two delimiter-based deletions at once, or to accept that only a single deletion will be made. Given that left-deletions must be done before right-deletions (to handle the A,B,C case), it is logical that only B would remain if the data contains A,B.

Would obviously recommend you not create a Windows binary of versions earlier than 0.2.51.

Makes sense. Like I said, I may compile something for my own use, but if you let me know when you think you have something sufficiently stable for public (pre-)release, I am happy to upload another Windows binary that you can host here.

I think we can close this one? Great work!

commented

Yes, a tour de force on your part, thanks for putting so much work into these ideas!