devongovett / regexgen

Generate regular expressions that match a set of strings

Home Page:https://runkit.com/npm/regexgen

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sub-optimal regex containing unnecessary repetitions

waldyrious opened this issue · comments

While building a regex for the various possible formats of Creative Commons' Public Domain Mark (to assist in spdx/license-list-XML#988), I noticed that regexgen produces a more complex regex than what the input requires.

Here's what I provided:

> regexgen([
  "This work is free of known copyright restrictions.",
  "This work (WWW) is free of known copyright restrictions.",
  "This work (by AAA) is free of known copyright restrictions.",
  "This work, identified by CCC, is free of known copyright restrictions.",
  "This work (WWW, by AAA) is free of known copyright restrictions.",
  "This work (WWW), identified by CCC, is free of known copyright restrictions.",
  "This work (WWW, by AAA), identified by CCC, is free of known copyright restrictions.",
  "This work (by AAA), identified by CCC, is free of known copyright restrictions."
]);

The result was:

/This work(?: (?:\((?:WWW(?:, by AAA)?\)(?:, identified by CCC,)?|by AAA\)(?:, identified by CCC,)?) )?|, identified by CCC, )is free of known copyright restrictions\./

Debuggex screenshot:

Screenshot 2020-03-10 at 12 14 13

A regex produced by hand to match the same input shows that this could be simplified:

/This work(?: \((?:WWW(?:, by AAA)?|by AAA)\))?(?:, identified by CCC,)? is free of known copyright restrictions\./

Debuggex diagram:

Screenshot 2020-03-10 at 12 22 50