False positive match for certain unicode characters

Question

False positive match for certain unicode characters

kkmuffme opened this issue 2 months ago · comments

kkmuffme commented 2 months ago

Bug Description

Reproduction steps

PHP >= 7.3

put regex:

a[–]b

for text:

a–b

Expected Outcome

No match.
Only when selecting the "u" flag it should match.

See php/php-src#14306

Browser

Chrome 124

OS

Win 11

Damian Wadley · Answer 1 · Fri May 24 2024 06:34:51 GMT+0800 (China Standard Time)

Not a bug, see my reply in the php/php-src issue.

kkmuffme · Answer 2 · Fri May 24 2024 12:04:23 GMT+0800 (China Standard Time)

I think you misunderstood the issue - it's NOT a bug in PHP (which is why I closed the issue in php-src before your reply already, as I realized that was the case).

But it's a bug in regex101 - bc regex101 shows a match even WITHOUT u flag (while PHP does not). This behavior difference is the bug.

Damian Wadley · Answer 3 · Fri May 24 2024 12:41:59 GMT+0800 (China Standard Time)

Ah, I see what you mean...
I'm guessing perhaps the WASM version implicitly supports non-ASCII strings? But I'm not sure what flavor library is involved here, or if it's a custom build specifically for the site.

Sorry @working-name, does seem there is a problem here after all 😓

Firas Dib · Answer 4 · Fri May 24 2024 14:35:39 GMT+0800 (China Standard Time)

Can this be because of the fact that the website uses UTF-16 while php uses UTF-8?

kkmuffme · Answer 5 · Fri May 24 2024 19:09:45 GMT+0800 (China Standard Time)

Possibly.
I guess the solution is similar to what happens already now when you use a[💩]b => when the "u" flag is not set I see 2 ? boxes - when I select "u" it shows the emoji correctly.
This is a 4 byte emoji, while – is 3 bytes.

Please reopen the issue.

kkmuffme · Answer 6 · Mon May 27 2024 21:15:33 GMT+0800 (China Standard Time)

Just tested and this bug exists for ALL multi-byte characters even basic UTF-8.
e.g. /a[ä]b/ for text aäb => in regex101 shows a match, in PHP it's not a match.

REASON WHY:
PHP preg is single byte only, unless "u" flag is provided.
utf-8 = 2 byte, utf-16 = 3 byte, utf-32 = 4 byte

Since 💩 is 4 byte it works correctly as it's not valid in utf-16 - but – is 3 byte and ä is 2 byte, therefore it's a valid in utf-16 which leads to this false positives

Pattern and subject strings are treated as UTF-8

https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

Solution: convert pattern and text into ISO-8859-1 without "u" flag and to "utf-8" with "u" flag?