Sub-optimal regex containing unnecessary repetitions
waldyrious opened this issue · comments
Waldir Pimenta commented
While building a regex for the various possible formats of Creative Commons' Public Domain Mark (to assist in spdx/license-list-XML#988), I noticed that regexgen produces a more complex regex than what the input requires.
Here's what I provided:
> regexgen([
"This work is free of known copyright restrictions.",
"This work (WWW) is free of known copyright restrictions.",
"This work (by AAA) is free of known copyright restrictions.",
"This work, identified by CCC, is free of known copyright restrictions.",
"This work (WWW, by AAA) is free of known copyright restrictions.",
"This work (WWW), identified by CCC, is free of known copyright restrictions.",
"This work (WWW, by AAA), identified by CCC, is free of known copyright restrictions.",
"This work (by AAA), identified by CCC, is free of known copyright restrictions."
]);
The result was:
/This work(?: (?:\((?:WWW(?:, by AAA)?\)(?:, identified by CCC,)?|by AAA\)(?:, identified by CCC,)?) )?|, identified by CCC, )is free of known copyright restrictions\./
Debuggex screenshot:
A regex produced by hand to match the same input shows that this could be simplified:
/This work(?: \((?:WWW(?:, by AAA)?|by AAA)\))?(?:, identified by CCC,)? is free of known copyright restrictions\./
Debuggex diagram: