devongovett / regexgen

Generate regular expressions that match a set of strings

Home Page:https://runkit.com/npm/regexgen

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Different output result with different word order

templarundead opened this issue · comments

#Example 1
input: ghost frost pos
output: ghost|frost|pos
expected output: (?:(?:fr|gh)ost|pos)
Example 2
input: pos ghost frost
output: (?:gh|fr)ost|pos
expected output: (?:(?:fr|gh)ost|pos)

Hmm… So let’s sort the input words by length ([...word].length) and then lexicographically?

That happens to work for ghost frost pos, but with my set of 100+ junk words, sorting from longest to shortest, then by Z-A, happens to generate the shorter RX (though not by much, 594 chars vs. the 608 with ascending length and A-Z). I also tried reversing the characters of each word when sorting A-Z and Z-A but this made no difference.

An interesting optimization problem. It would help if I actually looked at what regexgen was doing under the hood 😛