prefer-predefined-assertion does not recognize `(\.(?!$)|$))`

Question

prefer-predefined-assertion does not recognize `(\.(?!$)|$))`

kurtextrem opened this issue 9 months ago · comments

Jacob Groß commented 9 months ago

Information:
I used the playground to test (so I assume all latest?)

Description
Coming from this SO post: stackoverflow.com/a/36760050
The regex

^((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)(\.(?!$)|$)){4}$

got optimized (by hand) to:

^((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4}$

so, partbefore(\.(?!$)|$)) can be expressed as (partbefore\.?\b), avoiding the negative lookahead, making the regex both smaller and faster: esbench.com/bench/65326a147ff73700a4debb7d.

I assume prefer-predefined-assertion is meant to capture this, but it does not:
https://ota-meshi.github.io/eslint-plugin-regexp/playground/#eJwVydEKgjAUxvFXOR28cNBail60CB+hiy6dgbRjDHSOaVI5370FH/x/8K34GDWhRCHg6mYzmC9BazXczOB6032UVXZpPXiCC4C4p5WMy8v6yMsm/BlVNCFTSoc646eGVZEsPkod0mqXsJAwthZbIpYz7tG/eppQrujpSW8nnKeOPI/R1BlLmrfTRH42o0WZb9sP02szMA==

(by the way, if you turn on all rules, it errors for that regex: )

Yosuke Ota · Answer 1 · Fri Oct 20 2023 21:01:42 GMT+0800 (China Standard Time)

Thank you for posting this issue!

However, I don't think partbefore(\.(?!$)|$) and (partbefore\.?\b) are equivalent.
The following script returns a different result.

const s = 'partbefore.'
console.log(/partbefore(\.(?!$)|$)/.test(s), /(partbefore\.?\b)/.test(s)) // false true

I think there probably needs to be a $ after the quantifier, but I haven't yet figured out how that can be optimized.

Jacob Groß · Answer 2 · Fri Oct 20 2023 21:13:02 GMT+0800 (China Standard Time)

Yeah, looks like this is equivalent:
console.log(/^partbefore(\.(?!$)|$)$/.test(s), /^(partbefore\.?\b)$/.test(s))

-> partbefore(\.(?!$)|$))$ can be expressed as partbefore\.?\b$

Thank you for taking care already regarding the issue!

Michael Schmidt · Answer 3 · Sat Oct 21 2023 00:31:16 GMT+0800 (China Standard Time)

partbefore(\.(?!$)|$)$ can be expressed as partbefore\.?\b$

Those 2 equivalent, but probably not in the way you think they are. Here's how we can simplify partbefore(\.(?!$)|$)$ and /^(partbefore\.?\b)$/:

  /^partbefore(\.(?!$)|$)$/;
= /^partbefore(\.(?!$)$|$$)/; // move the last $ inside the group
= /^partbefore(\.(?!$)$|$)/; // $$ == $
= /^partbefore($)/; // (?!$)$ will always reject, so we can remove the whole branch
= /^partbefore$/; // remove the group

  /^(partbefore\.?\b)$/;
= /^partbefore\.?\b$/; // remove the group
= /^partbefore(\.|)\b$/; // replace \.? with an equivalent group
= /^partbefore(\.\b|\b)$/; // move \b inside the group
= /^partbefore(\.(?=\w)|\b)$/; // Since \. is a non-word character, `\b` is will enforce a word character after it
= /^partbefore(\.(?=\w)|(?!\w))$/; // Since "e" is a word character, `\b` is will enforce a non-word character after it
= /^partbefore(\.(?=\w)$|(?!\w)$)/; // Move $ inside the group
= /^partbefore(\.(?=\w)$|$)/; // (?!\w)$ == $
= /^partbefore($)/; // (?=\w)$ will always reject
= /^partbefore$/; // remove the group

So both e(\.(?!$)|$)$ and e\.\b$ are equivalent to e$. So they are equivalent, but the neither includes the \..

The only place where (\.(?!$)|$) -> \.?\b is correct is if the expression is in a quantifier, is preceded by a word character, and then followed by $. The simplest example for this would be (a(\.(?!$)|$))*$ -> (a\.?\b)*$. Those are very narrow circumstances, so I don't know whether we should include this.

Don't get me wrong. Adding this to the rule would be relatively easy, but I don't know whether this makes the regex more understandable/maintainable. Since \.?\b relies upon the last $ for correctness, removing the last $ or adding anything before would break the regex.

so, partbefore(\.(?!$)|$)) can be expressed as (partbefore\.?\b), avoiding the negative lookahead, making the regex both smaller and faster: esbench.com/bench/65326a147ff73700a4debb7d.

partbefore(\.(?!$)|$)) is faster on my machine.

Also, if you want the best performance, then consider inlining the quantifier to get rid of the assertions completely: /^(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)(?:\.(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)){3}$/

Michael Schmidt · Answer 4 · Sat Oct 21 2023 00:43:40 GMT+0800 (China Standard Time)

The craziest thing I've seen all day is that eslint-plugin-regex can actually transform /^partbefore\.?\b$/ into /^partbefore$/. Kinda. It first detects that \.? can be removed:

And after following the suggestion, it detects that \b can be removed:

It uses different reasoning, but I did not expect it to get it at all.

Jacob Groß · Answer 5 · Sat Oct 21 2023 16:08:41 GMT+0800 (China Standard Time)

@RunDevelopment I get the same result now too:

after updating to Chrome 120 (V8 12.0.114)

Pretty curious, thank you for the hint - and I highly appreciate the in-depth explanation for the regex deduction!!

So both e(\.(?!$)|$)$ and e\.\b$ are equivalent to e$

So just checking if I could follow your explanation - the only reason why \.b\b$ or the negative lookahead is needed, is because of the quantifier ({4}), which you've removed subsequently to increase the performance of the regex?

Jacob Groß · Answer 6 · Sat Oct 21 2023 16:12:15 GMT+0800 (China Standard Time)

That all being said, thank you @ota-meshi @RunDevelopment for fixing the two things quickly. From my side, we can close this thread too. Maintain-/readability and higher performance are good reasons for me to avoid optimizing the regex so to a smaller one.

One last thing: I wonder if it'd make sense to create a rule that aims for highest runtime perf of the regex, so that e.g.

^(?:(?:25[0-5]|(2[0-4]|1\d|[1-9]|)\d)(?:\.(?!$)|$)){4}$ 
->
/^(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)(?:\.(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)){3}$/

in that case it makes the regex more readable (I guess that is subjective?) and faster too.

Michael Schmidt · Answer 7 · Sat Oct 21 2023 22:04:30 GMT+0800 (China Standard Time)

One last thing: I wonder if it'd make sense to create a rule that aims for highest runtime perf of the regex, so that e.g.
^(?:(?:25[0-5]|(2[0-4]|1\d|[1-9]|)\d)(?:\.(?!$)|$)){4}$ 
->
/^(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)(?:\.(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)){3}$/

This will be difficult.

I can say this with some certainty because I've only recently added this transformation (and similar ones) to refa (my regex, NFA, and DFA library). However, refa uses its own simplified AST format, which makes implementing these transformations a lot easier. If you're interested, here's the code.

One of the main difficulties for this will be regexpp's AST (regexpp = the regex parser we use). Its AST is essentially readonly for the purposes of this project. So we would most likely have to invent our AST format that allows for easy mutations. Mutations are necessary, because it is practically impossible to do these optimizations in a single step. E.g. here are the steps required to optimize the IP regex:

// I will use "byte" instead of the long "(?:25[0-5]|(?:2[0-4]|1\d|[1-9]|)\d)"
/^(byte(\.(?!$)|$)){4}$/;
// rewrite a quantfier "(AB){n}" as "A(BA){n-1}B"
/^byte((\.(?!$)|$)byte){3}(\.(?!$)|$)$/; 
// remove the "|$" that will always reject
/^byte(\.(?!$)byte){3}(\.(?!$)|$)$/; 
// remove the "\.(?!$)|" that will always reject
/^byte(\.(?!$)byte){3}$$/; 
// remove the double "$"
/^byte(\.(?!$)byte){3}$/; 
// remove the "(?!$)" that will always accept
/^byte(\.byte){3}$/;

It's possible to do all of this programmatically, but it will require a lot of work since we would have to write an optimizer from scratch.

in that case it makes the regex more readable (I guess that is subjective?) and faster too.

In this case. Unfortunately, there are a few scenarios where the regex blows up when we try to remove assertions. This mostly happens for regexes of the form (a|b|c(?=d))+ where d overlaps with a|b|c. It's also difficult to predict which regexes will get simpler after applying assertions. E.g. in the above example, we first need to grow the regex (making it more complex) to simplify it. But it is not guaranteed that we can effectively simply a regex after growing it. E.g. in some of my tests for refa, I've seen regexes grow to 8x their original size to remove a few assertions. This is good for what refa is doing (NFAs can't handle assertions, so removing assertion is very important), but it's not useful here.

the only reason why \.b\b$ or the negative lookahead is needed, is because of the quantifier ({4}), which you've removed subsequently to increase the performance of the regex?

Yes. See above for more details. Generally, v8's regex engine seems to be pretty slow when it some to assertions (likely because assertions prevent certain optimizations), so rewriting regexes to avoid them often helps performance. Same for backreferences.

From my side, we can close this thread too.

👍