optimal-quantifier-concatenation: False positive for /^(a)./u

Question

optimal-quantifier-concatenation: False positive for /^(a)./u

iliubinskii opened this issue 2 years ago · comments

Information:

ESLint version: 8.20.0
eslint-plugin-regexp version: 1.7.0

Description

I think this rule should report /^a*.*/u, but not /^(a*).*/u.
If I replace /^(a*).*/u with /^.*/u I will not be able to use capture group.
At least there can be an option to suppress such warnings.

Ilia Liubinskii commented 2 years ago

Thx

Michael Schmidt · Answer 1 · Sun Jul 24 2022 03:51:00 GMT+0800 (China Standard Time)

Thanks for the report!

I'm not sure what the best solution is. We already detect whether capturing groups are involved, so it would be easy to add an option to suppress these kinds of warnings, but I'm not sure whether we should.

When I ran this rule on a large code base of regexes (PrismJS), I found several bugs related to capturing groups and only 2 false positives IIRC. In a lot of cases, this kind of pattern is a bug. E.g. the similar pattern /^(a*).+/u probably doesn't behave correctly.

Alternatively, we could add an exception for this kind of pattern. If a regex ends with (A{n,m})B* where A ⊂ B, then we probably don't have to report it. This pattern doesn't have the usual problems (quadratic backtracking, dead code, unclear intentions).

@ilyub Did this rule produce false positives for other regexes too? If so, could you please provide a few examples?

Ilia Liubinskii · Answer 2 · Sun Jul 24 2022 04:53:55 GMT+0800 (China Standard Time)

Thx for response.

Did this rule produce false positives for other regexes too?

So far, I have false positive in one place that can be simplified to something like this:

const firstWord = str.replace(/^(\w+).*/u, "$1");

I found several bugs related to capturing groups and only 2 false positives IIRC

Regexp plugin has "no-unused-capturing-group" rule. So, if bugs are related to unnecessary capturing groups then they should be first reported by "no-unused-capturing-group" and after removing unnecessary capture they can be reported "optimal-quantifier-concatenation". Can this be helpful?

Michael Schmidt · Answer 3 · Sun Jul 24 2022 23:00:51 GMT+0800 (China Standard Time)

So, if bugs are related to unnecessary capturing groups

They weren't. The capturing groups were necessary, but they probably didn't work correctly. E.g. the intention for /^(a*).+/u was most likely "capture all as at the start of a non-empty line," but that is not what this regex does (the edge case is that input string a). This type of error is somewhat common. As I said, I found several such bugs. (I'm sorry that I can't provide concrete examples for this. I made PRs for these bugs years ago, but I just can't find them right now.)

So the optimal-quantifier-concatenation rule reporting concatenations with capturing groups can be very helpful, because it finds this type of error.

However, this doesn't apply to your specific regex. /^(\w+).*/u is perfectly correct. That's why I'm so unsure about the best solution for this issue. This rule reporting concatenations with capturing groups is very useful, but I also don't want users to turn off the rule because it produces too many false positives.

Ilia Liubinskii · Answer 4 · Mon Jul 25 2022 00:07:47 GMT+0800 (China Standard Time)

You are right that ".+" is likely to be a mistake in /^(a*).+/u regular expression (edge case "a").
But it is also likely to be a mistake in /^(abc|xyz).+/u regular expression (edge case "abc").

The second regular expression is not reported.
So the problem you found is not directly related to inoptimal concatenation.
It is related to ".+" tail.

In fact you found this problem because false positive forced you to revisit and recheck old code.
Following this logic, you need to produce false positive for /^(abc|xyz).+/u as well (the more false positives the more places will be revisited).

IMHO:
I would not mix two different purposes in one rule.
I.e. I would leave "optimal-quantifier-concatenation" only for what it says (inoptimal concatenation) and would probably write separate rule to detect potential errors (like ".+" tail).

In any case, thx for the useful plugin.

Michael Schmidt · Answer 5 · Wed Jul 27 2022 02:13:11 GMT+0800 (China Standard Time)

I'm sorry for the delay.

So the problem you found is not directly related to inoptimal concatenation.

No, the problem is caused by concatenation. The reason /^(abc|xyz).+/u is okay is that (abc|xyz) "has no choice". It can't exchange characters with .+, because the capturing groups will always capture the first 3 characters (if the input string is accepted).

However, I do agree that finding these issues (= likely incorrect capturing groups) is not the purpose of this rule. I'll add an option to ignore warnings that cannot be fixed because of capturing groups.

optimal-quantifier-concatenation: False positive for /^(a*).*/u

optimal-quantifier-concatenation: False positive for /^(a)./u